A scalable system is one that can continue to perform in a reliable manner under variable and often increasing levels of load.
A system’s scalability is rarely a single variable analysis. It usually involves at least a two-dimensional problem: a load metric and time.
- How does the database system scales when IOPS1 increases from 1,000 to 10,000 over a period of one second?
- How load time is affected when website pageview requests grow from 200 to 5,000 over one minute?
What developers first need to do is expressing what load means for each of their systems.
Load could mean something different for each type of system. For a website, it can be visitors or pageviews per second. For a database, it could be concurrent queries, number of IO operations, or amount of data getting in and out of the database servers.
How load is described will also depend on the system architecture.
In an e-commerce website, for example, the system may scale to serve 100,000 people shopping at the same time across a thousand-item catalog. But what happens if 20% of those are shopping a single item?
This is the sort of circumstance that happens due to market trends and human behavior. Developers must account for these factors when thinking about load.
The more developers strive to anticipate possible challenging load scenarios for the system, the better it will behave in reality.
It is necessary to consider:
- The load profiles and metrics mentioned above
- How much and how fast load can vary
- Which resources are needed to cope with these variations without hurting performance or reliability
Resources can scale:
- Vertically (scale-up): increasing CPU power or RAM memory, for example
- Horizontally (scale-out): adding more servers to a cluster, for instance
A great number of healthy architecture will mix both approaches. Sometimes, having many small servers is cheaper than a few high-end machines, especially for highly variable loads. Large machines can lead to increased over-provisioning and wasted idle resources. In other cases, perhaps a big machine would perform faster and cheaper than a cluster.
It really depends on the case and developers must try different approaches to find one that suits both performance requirements and project budget.
Using serverless systems greatly simplifies the level of responsibility developers have over how systems cope with load. These services abstracts away decision-making about scaling-up or out, for example, and also provide SLAs that the development team can rely on.
Load Metrics and Statistics
Metrics will need some sort of aggregation or statistical representation. Average (arithmetical mean) is usually a bad way to represent metrics, because they can be misleading. It doesn’t tell how many users actually experienced that level of performance. In reality, no user might have experienced it at all.
Consider the following application load and user base:
The average response time would be 180 ms. But no user actually experienced that response time. In fact, 75% of the users experienced a performance that is worse than average. Arithmetic mean is highly sensitive to outliers, which is the case of the distribution above.
The most common percentiles are 95th, 99th and 99.9th (also often referred to as p95, p99 and p999).
A p95 level is a threshold with which at least 95% of the response times fell below. In the example above, our p95 would be 250. Since we have only a handful of request samples, it would be the same threshold for all percentiles. If we were to compute a p75, it would be 240 because 3 out of 4 requests were responded within 240 milliseconds.