<img height="1" width="1" style="display:none;" alt="" src="https://dc.ads.linkedin.com/collect/?pid=1005900&amp;fmt=gif">

Download the Guide

A Guide to Cost
Optimisation in AWS

by Dr. Manzoor Mohammed


Download as PDF


The global public cloud market will hit $178 billion this year, up from $146 billion in 2017 (Forrester Research), and public cloud adoption in enterprise is expected to exceed 50 percent for the first time this year. This whitepaper will refer specifically to AWS as they are by some way the leading provider of public cloud, however the principles and approaches apply equally to all public cloud solutions.

Innovating at breakneck speed and driving the pace of change through multiple major new releases each year, AWS have a range of products and services spanning not only infrastructure and storage, but also databases, container and serverless technologies, AI/ML, IoT and more. This breadth, depth and pace of offering brings enormous benefit to enterprise organisations in driving forward transformation, but also significant challenges.

Adopters of AWS need to change often ‘hard-wired’ internal processes (forcing organisational change) both in how software is designed, built and run but also in how infrastructure is provisioned, managed and paid for. They must adopt different skills and ways of thinking across the organisation (requiring new resource and staff training), and adapt to new procurement and commercial models (i.e. pay as you go, OPEX not CAPEX).

This whitepaper deals specifically with the challenges organisations face in managing and optimising AWS costs. Our aim in writing this is to share best practices for the most efficient and cost-effective ways to run AWS. The principles and methodologies we outline are intended to reinforce and complement AWS’s own best practice guidance, detailed in their Well-Architected Framework.

Key Takeaways

In this whitepaper we will:

  • Record the experiences and insights gained from optimising cloud costs with our clients
  • Provide a set of recommendations and best-practices for cloud cost optimisation

The following are required in order to take advantage of the flexibility of the cloud to enable cost savings:

  • Measurement and understanding of the efficiency of your systems and knowing what good looks like. NB. efficiency is more than just using rightsizing, autoscaling or other cloud technologies
  • A process to deliver ongoing rightsizing into live without
    service risk
  • The ability to remove technical constraints to enable rightsizing
  • Ongoing validation of live performance to look for early
    warning signs of risk by looking beyond just response times
    and throughput
  • A deep understanding of workloads and their inter-dependencies in complex eco-systems




Drivers of Overspending in AWS

There are four common contributors to overspending in AWS:

  • Oversizing
  • Software Inefficiency
  • Application Inelasticity
  • Sub-Optimal Architecture

Bonus: Forecasting complexity, three demand drivers.



This is the most common reason for overspend and the simplest to solve.  There are multiple factors that contribute to this, and they are often rooted in the design process:

  • Capacity added even when there is sufficient headroom
  • Inaccurate sizing due to weak performance testing methodologies
  • Inaccurate demand forecasts
  • Excess capacity put in place to compensate for software bottlenecks

The latter is a particular problem if this ‘temporary fix’ becomes a permanent solution.

Software Inefficiency


Software efficiency is one of Capacitas’ 7 Pillars of Performance. For transient resources, efficiency is defined as the amount of compute resource required per transaction and is a critical lever in the control of your cloud costs. This is particularly pertinent for high-volume systems. There is a similar calculation for persistent resources, e.g. storage.

Where does software inefficiency stem from?

  • Not measuring efficiency
  • Not knowing what ‘good’ efficiency looks like
  • Not having efficiency targets (non-functional requirements)
  • Other priorities on developers’ time

How important is this? In one client engagement, we identified (and worked with their developers to implement) a series of software optimisations which reduced their IT service opex costs from $3.3M to $0.3M per year.

Application Inelasticity

side-applicationAWS autoscaling is a one of many great ways to control cloud cost by adjusting capacity to meet changing demand.

However, applications which are inefficient and/or require long warm up times do not auto scale quickly enough and typically are not able to scale to use all the available capacity.  Applications which tend to be inelastic include databases, caches and inefficient applications. This leads to organisations having extra capacity headroom because they are not confident that their applications can scale up quickly enough to meet the demand.

In one example, a customer had an embedded practice to autoscale their systems at 50% CPU utilization. For one particular application, this resulted in $1M per year in unnecessary spend.


Sub-Optimal Architecture

side-subopChoosing a sub-optimal architecture for your workload will lead to higher cloud costs.  We’ve seen this occur when teams are under time pressure to simply ‘lift and shift’ to the cloud. The typical examples of this are

  • Large amounts of unnecessary storage being ported over to the cloud
  • The use of on-demand instances for non-time critical jobs, e.g. batch, which could be run on cheaper compute such as spot instances
  • Large workloads moved to expensive premium or manged cloud services such as dynamodb or Cassandra where the additional performance provided by these solutions is not essential for the business or system requirement

How Can We Optimise AWS Costs?

Our high-level process for achieving cost optimisation is shown in Figure 2. The solution below focuses on addressing the cost inefficiency challenges 1 – 4 which account for 70%+ of the cost optimisation opportunities in the cloud.


arrow-bulletIdentify Over-Supply & Software Inefficiency

The first step is a diagnostic to identify two symptoms of high cost: over-supply and software inefficiency.

Over-supply is when the capacity provisioned exceeds demand over the IT service’s demand cycle. The concept of over-supply may be applied to any AWS component.

For example, in the case of EC2 instances we would employ standard measures of CPU, memory and disk. For serverless components, such as Lambda, we would use time as a measure of resource consumption; this isn’t a precise measure of efficiency but can be a useful proxy.

Once over-supply is identified, we need to qualify that capacity can be reduced safely. When capacity is reduced, the performance of the application should not degrade and there should definitely be no service incidents. In order to make this assessment, we model the expected performance, post down-sizing using our Seven Pillars of Software Performance (below).



This enables us to quantify the performance risk associated with reducing supply capacity and thus prioritise which systems should be addressed.

Software efficiency is defined as the amount of compute resource required to process an application request or transaction, where compute resource includes processor, memory, disk space or I/O (network and disk). How do we decide whether software is efficient or inefficient? The business-function of the software will determine its compute requirements. For example we would expect e-commerce software to have a lower processing footprint (per request) than encryption software.

A big-data analytics platform will have a larger memory footprint (per request) than a document management service, and so on. As enterprise cloud environments typically have tens or hundreds of thousands of servers and components, we use automated software to harvest the following data:

  • Cloud configuration
  • Supply and utilisation data
  • Demand data
  • Cost data

Capacitas uses a library of software efficiency benchmarks, built up over hundreds of customer engagements, and dimensioned by software type, to assess whether the measured compute cost is efficient or not. This is likely to be more difficult for non-specialists whose experience of measuring efficiency Is limited to handful of systems.

In order to get around that your organisation needs to build its own library of efficiency benchmarks; over a number of years it will become clearer what good looks like (Figure 3). Just remember to keep the older benchmarks up to date as technology changes.

Cloud cost monitoring tools such as Cloudability and Cloudyn provide great information on where there is over supply in the estate. However, they cannot identify software inefficiency as a driver of cloud capacity consumption.

As an output from this phase we will have a list of the candidate systems which could be downsized and the potential cost reduction opportunity.








Identify Optimisations and Associated Risk & Cost

The next stage is to quantify how much optimisation we can realistically achieve, given the constraints. What should our $ cost optimisation target be?

The goals of this phase are:

  1. Define what right-sizing is required
  2. Define architectural optimisations
  3. Define what software efficiency improvements are required
  4. Define what configuration change is required to increase application elasticity
  5. Quantify the performance risk of changes [1-4]
  6. Quantify the $ cost optimisation that changes
    [1-4] will achieve

Once we have quantified the performance risk of the changes, we can build a picture of what changes can be realistically delivered, without impacting the operation and reliability of production services (Figure 5).


Cloud cost monitoring tools provide great information on about over supply in the estate. However, the key question remains: Can I safely reduce capacity without impacting the performance of the application? Unfortunately, these tools cannot inform this decision.

arrow-bulletPlan Optimisation

In this phase we produce the detailed low-level designs for each type of optimisation:

  • Right-sizing
  • Architectural optimisations
  • Software efficiency improvements
  • Changes to increase application elasticity

arrow-bulletProve Optimisation in Test

Where optimisations carry appreciable performance risk, we should first check performance in a test environment before pushing to production.

We use the 7 Pillars of Performance to design and target performance tests in the right area. For example, downsizing EC2 capacity may present a risk to the throughput and response time of an application when under peak load.

Conversely changing the memory footprint of a database service may present a stability risk.

To scale this capability and avoid false-positive test results, it is critical to use automated performance test analysis. This will speed the route to optimisation and minimise the risk of service-impacting incidents when the optimisations go into live.


arrow-bulletImplement Optimisation in Live

This is done in close partnership with the DevOps teams responsible for the system. The DevOps engineer will implement the rightsizing typically in conjunction with a capacity and performance expert. The combination of a DevOps engineer who is familiar with the system and a capacity and performance expert who knows what good looks like over a wider range of metrics than the DevOps engineer will typically look at will enable the optimisations to be delivered into live safely.

The important thing to remember, is that these optimisations can be backed out almost immediately in a cloud environment.  Any early warning signs can allow the DevOps and capacity engineers to fine tune the level of optimisation they want to implement in live.


arrow-bulletValidate Performance in Live

When the optimisation goes live, it is important to ensure the change has been delivered successfully over the wider demand cycle. Success criteria will include:

  • Has the change delivered the expected cost-optimisation?
  • Has the change resulted in the expected performance
    behaviour – either modelled or from performance test results?

In is important to note that the last point relates to both the system itself and the upstream and downstream systems it is integrated with.

A key tenet of performance engineering is that capacity changes to a system can have adverse performance impact on coupled systems.

Production Validation is the process of measuring and reporting against these success criteria. Production Validation will take as an input multiple data sources, including:

  • Capacity monitoring (AWS CloudWatch, etc)
  • Cost monitoring (Cloudability, etc)
  • Application Performance Management Tools (New Relic,
    AppDynamics, etc.)
  • Application integration design


arrow-bulletRemove Technical Constraints

What do we do if the change is unsuccessful and adversely impacts performance of the system? Most likely the excess capacity provisioned is masking a fundamental bottleneck in the application code. We term these generically as technical constraints.

In this phase we define the technical constraints through a problem definition statement.  Targeted performance testing may be required to define the problem.

Next we define a risk mitigation plan to fix the constraints and plan what investment is required to redesign the code. As we know what the cost-reduction opportunity is for each change, we can build a business case for implementing the mitigation plan.



Next Steps

If you find this whitepaper relevant and interesting, you might also find these resources helpful:

How MetaPack Gained Control of their Cloud Costs
Using Auto Scaling to Control AWS Costs

The Seven Pillars of Performance


Capacitas Blog

Dr Manzoor Mohammed

Dr. Manzoor Mohammed


About the Author

Dr. Manzoor Mohammed has worked in the area of capacity and performance management for over 20 years.  He started his career as a performance engineer at BT Research Laboratories. He co-founded Capacitas, a consulting company which reduces cost and risk in business-critical IT systems through capacity and performance management. 

He has worked on numerous large complex projects for customers such as BT Global Services, HP, Skype/Microsoft, easyJet, Nokia etc. 

Many of these engagements have resulted in $multi-million savings in datacentre and cloud platform costs, as well as better performing and more stable systems.

Dr. Mohammed leads the R&D function at Capacitas, having developed a ‘shapes-based’ methodology for the automated analysis of performance and capacity issues, which forms the basis of a suite of data analytics tooling.

Bring us Your IT Challenges

If you want to see big boosts to performance, with risk managed and costs controlled, then talk to us now to see how our expertise gets you the most from your IT.

Book a Consultation