All-in-one serverless DevOps platform.
Full-stack visibility across the entire stack.
Detect and resolve incidents in record time.
Conform to industry best practices.
One of the most common concerns when moving to the cloud is cost. Given that cloud allows you to turn IT costs from CAPEX (long-term investments ex. in hardware equipment and software licenses) into OPEX (day-to-day operating expenses), it’s crucial to choose the right service and estimate it properly. In this article, we’ll look at the common pitfalls and discuss how you can avoid them to truly benefit from the cloud’s elasticity.
The lift and shift approach means that you are moving an exact copy of your workload to the cloud with as few changes as possible. Even though this pattern may be useful if you want to move to the cloud quickly, it may lead to suboptimal usage of your resources. AWS acknowledged that this is a difficult problem by creating services to make this migration easier (CloudEndure Migration and AWS Server Migration Service). Still, for the best possible resource utilization, it’s best to consider rearchitecting your solution for the cloud.
With lift and shift, you are potentially leaving a lot of money on the table, when looking at it long-term. You would also likely miss out on many benefits your cloud provider can offer. For instance, when choosing fully-managed AWS Aurora over a traditional Postgres instance, you can gain (among others) 3x more throughput, storage autoscaling, and low-latency read replicas. This may be the reason why Aurora is currently one of the most popular and fastest-growing services on AWS.
It’s difficult to improve something if you don’t have enough data to make an informed decision about it. If you have no way of tracking how your cloud resources perform and how much costs they incur, it’s difficult to optimize their utilization.
It’s considered a best practice to tag your resources based on projects or organizational units to correctly allocate costs to the corresponding services.
Managing cloud architecture is not a one-off process. It’s a continuous practice of monitoring and evaluating what you use, how you use it, and why. Perhaps your original assumptions about the growth of a specific application turned out to be not entirely right and making a change could significantly lower costs.
For instance, consider an overprovisioned Kubernetes cluster with many more nodes than needed. Perhaps moving to a serverless version (EKS on Fargate) makes more sense in such a scenario.
Leaving “zombie” resources running unmonitored is not as uncommon as you may think. In a larger organization, it can happen that some projects get abandoned and the corresponding resources remain active due to incomplete handover processes.
As software engineers, we may sometimes be tempted to building our own custom solutions and services for everything. A potentially better approach is to first do proper research of what’s already available. Examples:
In general, if there is a serverless or fully-managed solution, it makes sense to at least consider it before investing too much time and effort into your own solution that you would have to maintain entirely yourself.
Often when reading some Reddit or blog posts, I see many engineers who are reluctant to use serverless or container orchestration platforms simply because all they know is EC2 and manually administered servers. They assume that it’s all just a new technology that will “come and go” and there is, therefore, no need to change your ways. This implies that there is no merit in moving to container orchestration platforms, serverless and other cloud services. This seems to be a close-minded approach. It’s better to challenge our assumptions and judge new technologies with clear facts, costs, and performance benchmarks, rather than by skepticism towards what’s new.
If you would create an EC2 instance for every service and tool you manage, you would likely end up in a maintenance nightmare. But if you instead deploy each of your services to a container deployed to a Kubernetes (EKS) or Fargate (ECS) cluster, you can allocate much more resources into a single server instance due to dynamic port mapping and more compact resource utilization of containers (ex. shared layers).
Container orchestration platform will help you ensure that you balance the load between the instances and that your workloads will stay healthy. They take the capacity guesswork, to some extent, out of the picture. You can specify how many container instances should be running at all times and the control plane will ensure that it happens, just as you defined it.
If you can easily load balance your workload across many containers or serverless resources, then you no longer have to guess which EC2 or RDS instance size will be appropriate for your use case.
If you only consider the hardware or service costs, you may end up thinking that many resources can be more cost-effective on-prem. But if you add up the costs of additional maintenance, upgrades, and employees managing those servers, that’s an entirely different story.
If you scale your resources purely based on your current situation, you may fail to take into account how your needs may change in the future. What if your business and data grow much faster? What if it turns out to be the opposite? Is your application still easy to change and adapt to unknown future scenarios? And finally, will you be able to find and retain enough employees that can operate around those needs in the long run?
On the other extreme, if you want to be cautious, you may be tempted to overprovision everything to make sure you are ready for usage spikes. It’s a good strategy provided that you can justify the spikes based on past usage patterns. But it can be a bad strategy if you are doing it out of gut feeling.
Cloud allows elasticity in the sense that you can add nodes to your clusters, load balance the workload across more containers, or increase the number of vCPUs or memory size when you see the need for it. If configured and monitored properly, there is no need to overprovision anything. I’m not saying that right-sizing is easy (far from it), but with good processes and automation in place, it’s doable and can significantly save costs, especially when operating numerous resources at scale.
Sometimes the bottlenecks are not the compute resources, but rather a poorly chosen data store. It’s good to consider:
It’s naturally use-case dependent, but the databases often constitute the main bottleneck of any scalable architecture.
One possible solution to optimize your cloud resource utilization is to leverage automation. For instance, with Dashbird, you can keep track of your under- and overprovisioned resources and get notified about them. When using the well-architected lens dashboard, we can find out that our ECS cluster with EC2 instance type (non-serverless data plane) had a CPU utilization of over 90% within the last hour.
Then, we can drill down into specific time intervals and inspect further why this spike occurred.
At the same time, another containerized service may be overprovisioned, potentially leaving money on the table. Having this information allows you to optimize your resource configuration based on the actual usage patterns.
In this article, we investigated common pitfalls when sizing your cloud resources and discussed how to avoid them to truly benefit from the cloud’s elasticity. By making use of container orchestration platforms, serverless and fully-managed solutions, and by continuously monitoring your usage patterns over time, you can optimize your architecture for performance and costs.
Deploying AWS Lambda with Docker Containers
How Dashbird Innovates Serverless Monitoring
Cutting Step Functions Costs on Enterprise-Scale Workflows
Today we are excited to announce scheduled searches – a new feature on Dashbird that allows you to track any log event across your stack, turn it into time-series metric and also configure alert notifications based on it.
One of the most vital aspects to monitor is the metrics. You should know how your cluster performs and if it can keep up with the traffic. Learn more about monitoring Amazon OpenSearch Service.
Dashbird recently added support for ELB, so now you can keep track of your load balancers in one central place. It comes with all the information you expect from AWS monitoring services and more!
Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.
Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.
Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.
Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.
I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.
Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.
Great UI. Easy to navigate through CloudWatch logs. Simple setup.
Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.
End-to-end observability and real-time error tracking for AWS applications.