Getting Engineers on Board with Cloud Cost Optimisation Part 2

Estimated read time: 4 Minutes

Author: Dr. Manzoor Mohammed

After following the steps in our previous post Getting Engineers on Board with Cloud Cost Optimisation Part 1/3 you'd have built lots of goodwill with the engineering teams and they're now ready to talk about costs.

This approach will give you the big engineering wins, that will save you money, make your platform more scalable, improve stability and build a high performing team. You will look like an even bigger hero to the board and teams because you would have achieved what seemed impossible.

These 3 changes will help you achieve this:

Recognise that tools alone won't solve stability and cost
Predict what's going to happen tomorrow beyond cost
Be ambitious, and plan for unit costs to drop by 20% per year!

1. Recognise that tools alone won't solve stability and cost

Rightsizing recommendations from tools using ML algorithms, e.g. Cloudwatch, Opsani, etc... are often wrong. They assume capacity and performance data is genuine. It is often magnified by other hidden factors. This drives up cost as you add extra capacity to prevent service incidents. This also makes service stability worse as underlying issues aren't fixed.

You need to do something different. Look for CPU/memory usage that is erratic or doesn't follow the profile of business drivers, e.g. active agents. I look at patterns rather than numbers to find these hidden factors. Removing these will make the service more stable and engineers will breathe a sigh of relief. Importantly, you will get the full savings and benefits of the tools (if you've gone with them).

In our engagement with a leading telecoms business, the search system had CPU utilisations spikes which prevented rightsizing. A deep dive found the CPU spikes were driven by a downstream retry mechanism rather than doing useful work. Fixing this improved the scale of the search system and stopped giving the wrong results to users.

2. Predict what's going to happen tomorrow beyond cost

The engineers will love you if you can tell them what their systems will look like tomorrow and what needs to be done to make sure they scale safely. At the same time start talking about it from a capacity, performance and cost angle (simultaneously).

Most teams build only cost models not cost, capacity and performance models. This new type of model is driven by business drivers, e.g. agents. It will tell you how many CPUs, VMs, disk IOPS etc for different business scenarios, agents types, etc.

More importantly, it will tell you whether the spend is on track with what you expected in terms of business demand and if it's not on track where it's deviating from normal.

We recently built a cost, capacity and performance model for a leading European SaaS company, the model identified $400k of savings in one month by looking at the discrepancies between what it should be and what was happening on the ground.

3. Be ambitious, and plan for unit costs to drop by 20% per year!

What is the cloud unit cost for you? It is the cost of supporting a business processing event. This is a great metric to report to the board. Companies such as Lyft report it to the board. To get an accurate unit cost, separate the fixed and variable costs. This gives a level of insight into the team and platform efficiency that wasn't available before.

As you scale, an optimised cloud architecture will see the unit cost fall. In addition, cloud capacity is getting 20% faster and cheaper per year. The more you can leverage these two powerful levers, the greater the amount of value you can add per year (scaling efficiency and falling cost of cloud capacity).

At a leading Genomics business, we used this reporting metric to help teams improve their cost base and efficiency.

In summary, using the above approach allows businesses of any size to continue growing the business while reducing the costs and improving service stability.

Insights