10 Mistakes to Avoid When Sizing Cloud Resources

Anna Geller

March 29th, 2021

One of the most common concerns when moving to the cloud is cost. Given that cloud allows you to turn IT costs from CAPEX (long-term investments ex. in hardware equipment and software licenses) into OPEX (day-to-day operating expenses), it’s crucial to choose the right service and estimate it properly. In this article, we’ll look at the common pitfalls and discuss how you can avoid them to truly benefit from the cloud’s elasticity.

#1 Following the lift and shift approach

The lift and shift approach means that you are moving an exact copy of your workload to the cloud with as few changes as possible. Even though this pattern may be useful if you want to move to the cloud quickly, it may lead to suboptimal usage of your resources. AWS acknowledged that this is a difficult problem by creating services to make this migration easier (CloudEndure Migration and AWS Server Migration Service). Still, for the best possible resource utilization, it’s best to consider rearchitecting your solution for the cloud.

With lift and shift, you are potentially leaving a lot of money on the table, when looking at it long-term. You would also likely miss out on many benefits your cloud provider can offer. For instance, when choosing fully-managed AWS Aurora over a traditional Postgres instance, you can gain (among others) 3x more throughput, storage autoscaling, and low-latency read replicas. This may be the reason why Aurora is currently one of the most popular and fastest-growing services on AWS.

#2 Not tagging your resources

It’s difficult to improve something if you don’t have enough data to make an informed decision about it. If you have no way of tracking how your cloud resources perform and how much costs they incur, it’s difficult to optimize their utilization.

It’s considered a best practice to tag your resources based on projects or organizational units to correctly allocate costs to the corresponding services.

#3 Failing to monitor resource usage over time

Managing cloud architecture is not a one-off process. It’s a continuous practice of monitoring and evaluating what you use, how you use it, and why. Perhaps your original assumptions about the growth of a specific application turned out to be not entirely right and making a change could significantly lower costs.

For instance, consider an overprovisioned Kubernetes cluster with many more nodes than needed. Perhaps moving to a serverless version (EKS on Fargate) makes more sense in such a scenario.

Leaving “zombie” resources running unmonitored is not as uncommon as you may think. In a larger organization, it can happen that some projects get abandoned and the corresponding resources remain active due to incomplete handover processes.

#4 Always doing everything yourself

As software engineers, we may sometimes be tempted to building our own custom solutions and services for everything. A potentially better approach is to first do proper research of what’s already available. Examples:

Perhaps you don’t need this self-hosted database on EC2 and can instead use a fully managed RDS which can help you scale and operate the instance much easier?
Or maybe you don’t need this self-managed RabbitMQ instance and can instead use the battle-tested serverless message queue SQS?

In general, if there is a serverless or fully-managed solution, it makes sense to at least consider it before investing too much time and effort into your own solution that you would have to maintain entirely yourself.

#5 Using only tools you are familiar with

Often when reading some Reddit or blog posts, I see many engineers who are reluctant to use serverless or container orchestration platforms simply because all they know is EC2 and manually administered servers. They assume that it’s all just a new technology that will “come and go” and there is, therefore, no need to change your ways. This implies that there is no merit in moving to container orchestration platforms, serverless and other cloud services. This seems to be a close-minded approach. It’s better to challenge our assumptions and judge new technologies with clear facts, costs, and performance benchmarks, rather than by skepticism towards what’s new.

#6 Not making use of serverless and container orchestration platforms

If you would create an EC2 instance for every service and tool you manage, you would likely end up in a maintenance nightmare. But if you instead deploy each of your services to a container deployed to a Kubernetes (EKS) or Fargate (ECS) cluster, you can allocate much more resources into a single server instance due to dynamic port mapping and more compact resource utilization of containers (ex. shared layers).

Container orchestration platform will help you ensure that you balance the load between the instances and that your workloads will stay healthy. They take the capacity guesswork, to some extent, out of the picture. You can specify how many container instances should be running at all times and the control plane will ensure that it happens, just as you defined it.

If you can easily load balance your workload across many containers or serverless resources, then you no longer have to guess which EC2 or RDS instance size will be appropriate for your use case.

#7 Not taking TCO into account

If you only consider the hardware or service costs, you may end up thinking that many resources can be more cost-effective on-prem. But if you add up the costs of additional maintenance, upgrades, and employees managing those servers, that’s an entirely different story.

#8 Thinking short term

If you scale your resources purely based on your current situation, you may fail to take into account how your needs may change in the future. What if your business and data grow much faster? What if it turns out to be the opposite? Is your application still easy to change and adapt to unknown future scenarios? And finally, will you be able to find and retain enough employees that can operate around those needs in the long run?

#9 Overprovisioning everything “just in case”

On the other extreme, if you want to be cautious, you may be tempted to overprovision everything to make sure you are ready for usage spikes. It’s a good strategy provided that you can justify the spikes based on past usage patterns. But it can be a bad strategy if you are doing it out of gut feeling.

Cloud allows elasticity in the sense that you can add nodes to your clusters, load balance the workload across more containers, or increase the number of vCPUs or memory size when you see the need for it. If configured and monitored properly, there is no need to overprovision anything. I’m not saying that right-sizing is easy (far from it), but with good processes and automation in place, it’s doable and can significantly save costs, especially when operating numerous resources at scale.

overprovisioned prod resources dashboard incident — Overprovisioned prod resources

#10 Choosing the wrong datastore

Sometimes the bottlenecks are not the compute resources, but rather a poorly chosen data store. It’s good to consider:

whether you need a rich query language (SQL) or perhaps your application can do just fine with a simple key-value store (ex. DynamoDB),
whether you need a database in the first place; perhaps a simple S3 data dump is enough.

It’s naturally use-case dependent, but the databases often constitute the main bottleneck of any scalable architecture.

How to mitigate the right-sizing problem?

One possible solution to optimize your cloud resource utilization is to leverage automation. For instance, with Dashbird, you can keep track of your under- and overprovisioned resources and get notified about them. When using the well-architected lens dashboard, we can find out that our ECS cluster with EC2 instance type (non-serverless data plane) had a CPU utilization of over 90% within the last hour.

well architected lens dashboard showing ecs cluster cpu — Well-architected lens dashboard

Then, we can drill down into specific time intervals and inspect further why this spike occurred.

Underprovisioned ECS cluster reaching the CPU capacity limits

At the same time, another containerized service may be overprovisioned, potentially leaving money on the table. Having this information allows you to optimize your resource configuration based on the actual usage patterns.

overprovisioned ECS service warning — Overprovisioned ECS service

Conclusion

In this article, we investigated common pitfalls when sizing your cloud resources and discussed how to avoid them to truly benefit from the cloud’s elasticity. By making use of container orchestration platforms, serverless and fully-managed solutions, and by continuously monitoring your usage patterns over time, you can optimize your architecture for performance and costs.

Read our blog

Making serverless applications reliable and bug-free

In this guide, we’ll talk about common problems developers face with serverless applications on AWS and share some practical strategies to help you monitor and manage your applications more effectively.

ANNOUNCEMENT: new pricing and the end of free tier

Today we are announcing a new, updated pricing model and the end of free tier for Dashbird.

4 Tips for AWS Lambda Performance Optimization

In this article, we’re covering 4 tips for AWS Lambda optimization for production. Covering error handling, memory provisioning, monitoring, performance, and more.

Made by developers for developers

Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.

Get started free or learn more

What our customers say

Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.

Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.

Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.

I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.

Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.

Great UI. Easy to navigate through CloudWatch logs. Simple setup.

Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.