Author: Dr. Manzoor Mohammed
Estimated read time: 3 minutes
Teams are always looking to get better visibility of their systems. The first step on this journey is to invest in APM tools and/or build in-house monitoring tools to gather data. But we often find that despite making significant investments, CIOs and their teams feel they didn’t get the visibility they expected.
How should they fix this? We recommend this simple three-step plan:
- Focus on key metrics
- Move away from thresholds
- Don't explain away
1. Focus on key metrics
In one major Telco organisation, they had 250,000 dashboards, yet service incidents continued and costs kept on rising. The teams were being drowned in data and alerts. It was difficult to see what the problems were and really understand the system.
The key is to focus on key metrics. For a start, many of your metrics will simply be telling you the same story. Why look at hundreds of metrics when you can just look at one and it tells you everything you need to know?
Other metrics may be useful for deep dive root cause analysis, but you don’t need them in order to understand the problem.
2. Move away from thresholds
Thresholds are reactive, they don't give enough of an early warning sign before incidents happen. To counteract this ops teams have to lower the thresholds, in order to give more time to react. The challenge here is that the number of alerts starts increasing leading to alerts overload, as it’s impossible to tell which alerts really matter.
There’s an increasing availability of Analytic techniques to look for changes from normal behaviour. These have evolved as AI/ML techniques. AWS, new relic provide these features. They are a huge improvement from threshold alerting but still suffer from too many alerts due to the number of metrics and getting the tuning of the sensitivity of these alerts right.
3. Don’t explain away
Even where you have a focus on key metrics and use of advanced analytic techniques, early warning signs can still be ignored or missed. While working with a major SAAS provider analytics showed that the response times were increasing in variance over time.
The engineers saw it as only being a few milliseconds and thought it was due to changes in the user behaviour or workload. This seemed to explain the problem and it was quickly forgotten. Several months down the line the service experienced an outage and prolonged service degradation that lasted nearly a week. The early warning signs had been missed.
Observability is to interpret what is happening underneath the covers. If the behaviour isn’t as expected don’t jump to conclusions too quickly. Instead, use simple models to test out your theory (see next).
Predict – what will happen next?
By this stage you are collecting the right metrics, you have moved away from simplistic thresholds, and are no longer explaining away anomalies. But you can't just investigate every anomaly. This is where you need to use smart modelling techniques to predict what will happen next and assess its seriousness.
Even using a simple M/M/N response time chart and comparing your response time metrics to that can be incredibly powerful in assessing the level of risk associated with the anomaly. If it's deviated very far from normal behaviour you can start doing a deeper dive using more metrics and traces from your APM/monitoring tools.
After the deep dive you can start asking insightful questions on how can you improve both systems and teams' working practices. You’re now using data to give you early warning signs to prevent incidents, and help improve the organisation. You can now see the full picture and have true visibility.