<img height="1" width="1" style="display:none;" alt="" src="https://dc.ads.linkedin.com/collect/?pid=1005900&amp;fmt=gif">

Download the Guide

A Guide to Agile Performance

How to Move Fast and Not Break Things

by Dr. Manzoor Mohammed & Thomas Barns

Download as PDF

Introduction

This whitepaper is designed for roles at all levels across enterprise IT involved in major change programmes (including digital transformation, re-platforming, cloud migration, datacentre migration) and following, or planning to adopt, agile and devops delivery models.

We cover the core principles and best practice approaches for ensuring good performance, whilst increasing the velocity of delivery.

 

Key Takeaways

key_1Performance is not simply about response times and throughput. That is too simplistic a way to measure performance. An all-embracing approach to measuring performance is required. Capacitas’s 7 Pillars of Performance provide a comprehensive way of measuring performance.

key_2In Agile & Continuous Cycles, there is simply not enough time to test every change. A risk based approach, using techniques such as Risk Modelling is required.

key_3Shift Left & Continuous Collaboration. There needs to be a shift from performance engineers and analysts testing at the end to one where they are involved throughout the lifecycle and are collaborating with the developers to build a better understanding of the software and also refine the conceptual model of the platform.

key_4

Smart Design: smart test designs are needed to expose risks at lower loads and in narrow test windows.

key_5Automation of Testing & Analysis. Automated analysis needs to address not just response times and throughput but all 7 Pillars of Performance.

The "Old World"

As we all know, the traditional software design methodology for delivery was the waterfall model. The waterfall model is a sequential (non-iterative) design process.

Projects which followed the waterfall methodology had a high risk of project slippage and a risk that the end- product wasn’t what the business actually wanted by the time the project was completed. This was because either the requirements were not comprehensive enough in capturing the business needs or the business requirements had changed by the time the end product was delivered.

The performance of the software delivered by these projects was typically ensured with long periods of testing near the end of the project lifecycle. These performance tests were often long and complicated, e.g. soak tests of 24 hour durations. They were typically carried out by performance testers who were very much focused on checking that the tests met formal Non-Functional Requirements (NFRs).

ap-image-one

 

Businesses Need IT to Deliver Software Faster

In digital markets, there is lots of competition within different sectors.

For example, in the airline sector, easyJet, Ryanair and BA are in competition with each other to introduce new products to increase revenue and attract new customers. In 2007, easyJet launched Speedy Boarding before Ryanair. A few years later they achieved a similar coup with Allocated Seating. This gave them a competitive advantage over Ryanair in increasing the revenue per passenger and also improving customer experience.

In the retail sector businesses are facing direct competition from Amazon because of their ability to provide an excellent digital experience. Retailers need their IT departments to deliver more functionality that can match the Amazon digital experience and do so within short timescales. Within Banking and Financial Services there is a rapidly growing ecosystem of Fintech start-ups threatening to undermine the conventional business models – again the established players need to quickly innovate or risk losing market share.

chart-one


The use of conventional waterfall methodologies and their associated timescales are not fit for purpose in this new era of fast delivery in highly competitive markets.

Download the guide as a PDF

But What About Performance?

chart-two


Software in live is normally distributed over a complex IT environment on multiple-tiers of infrastructure and different technologies. How, in this world of rapid delivery and complex infrastructure and technologies, can we still deliver performance and minimise the risk to user experience and loss of revenue?

Capacitas has defined 7 Pillars of Performance. If any one of these pillars fail then the overall performance and end user experience is impacted and / or the cost of supporting the service increases substantially.

 
table_one
Throughput & Response Time

This is the most widely understood criteria for performance. It measures the speed and successful throughput of the software. The speed can be measured at different points in the platform, based on user experience or time taken by the technical platform. Throughput is a measure of the rate at which work is achieved. For example, this could be the number of page views achieved per second or the number of database transactions processed per minute.

table_two
Capacity

This is how much capacity you need to support the software. There is a myth that in the world of cloud this is no longer an issue. In fact, the amount of cloud capacity provisioned has a direct bearing on cost incurred. In addition, cloud capacity is not always instantaneously available

table_three
Efficiency

This is a measure of how much capacity is used to deliver a business function, e.g. the number of CPU seconds used to deliver the search functionality of a digital platform. We often work with applications that are able to meet throughput and response requirements, but, due to their inefficiency, require a large number of server instances to run. This leads to excessive run cost.

table_four
Scalability

This is a measure of whether software can scale linearly with increasing load and can use all the available capacity. If it can’t then it will act as a drag on the speed of delivering software change in the future.

table_five
Stability

This is a measure of how stable performance is over long periods of time and prolonged periods of load.

table_six
Resilience

This looks at how software behaves when an internal or external interface slows down or becomes unavailable. We would expect the parts of the software which do not call these internal and external interfaces to remain unaffected when these interfaces slow down or become unavailable. Usually, software is better at handling the non-availability of interfaces rather than the interfaces slowing down. A term that is sometime used by our customers is that their software can only handle “happy day scenarios”.

table_seven
 Instrumentation

Instrumentation is critical. Without it we simply can’t understand the 6 pillars mentioned above. APM tools (such as AppDynamics, Dynatrace, New Relic) provide an invaluable source of data – however they need to be used in conjunction with other sources of data to get a comprehensive view of the performance. Using the ITIL Framework, our metrics fall within three categories: Business, Service and Component.

The Holy Grail of Faster Delivery & Good Performance

Although development methodologies and technologies have evolved to deliver software faster, the same innovations for managing performance have not been put in place. The following section details an approach to delivering the holy grail of delivery speed whilst maintaining good performance. How do we ensure that the 7 Pillars of Performance are maintained given that developers must not be slowed down?

Facebook famously had a mantra of moving fast and breaking things. However, they have more recently back-tracked from this as they found that they were spending more time fixing production bugs than delivering new functionality. Our recommendation is that you can use this approach where your software is not business-critical. However, if your software is business critical there are Four Reasons why this will not work.

 

arrow-bullet

It does not work if you have large peaks.

Some performance defects only manifest at peak load. You can you release software on a normal load day and everything is fine. However, six months later, during a peak day, the defect in that release causes a system outage. At this stage it is very difficult to pinpoint the root cause of the defect.


chart-three


arrow-bulletSome defects only manifest over prolonged periods.

Performance defects such as memory and CPU leaks only manifest over long periods of time. On release all may look well, however after a period of time an incident will occur. Since multiple releases would have taken place in the period between, it is very difficult to unravel the code to find the root cause.

arrow-bullet

It may not be a leak.

It could also be a fundamental design flaw that only manifests after multiple releases are applied to weak foundations.

chart-four

arrow-bullet

It is far more expensive to fix defects in live than it is early in the software lifecycle.

Research suggests that fixing defects in live is 100x more expensive than fixing defects during the design stage. We find in our customer engagements, complex performance defects take a long time to identify and resolve. This is due to a combination of many factors such as the complexity and distribution of the system that supports the software. A publicly documented example of this is Netflix who had a performance issue using node.js that took 16+ days to resolve.

chart-five

arrow-bulletCost & Efficiency.

This is often overlooked. There will be frequent releases into live in an Agile/Continuous delivery approach. On numerous engagements we have observed a small increase per release resulting in a cumulatively large increase over a longer time frame. Increases of 1-3% in capacity consumption per release are not uncommon in these frequent delivery models. This would lead to an annual increase in capacity consumption over a year up to 38% (assuming monthly releases).

In the cloud, this has a direct correlation with run cost.

chart-six

 

"I Can't Possibly Build All of this into QA
Within My Available Time & Budget, Right?"

Wrong! You can by working smarter:

  • By taking a Risk-Based approach
  • By Automating execution and analysis

Your approach should reflect five key principles:

1

Identify Performance Risk Comprehensively.

Assuring throughput and response is not enough. Risk across the 7 Pillars of Performance must be addressed.

2

Implement a Lifecycle Risk Management Strategy that Focuses Time & Effort on High-Risk Items.

  • In an Agile/CI delivery approach it’s complex and time-consuming to test everything. Also testing is not always the appropriate solution. A more efficient approach is to look at those changes which are likely to be high risk of a defect being introduced and high risk that this defect would have a significant impact.
  • For instance, a change to search on an e-commerce website could be high risk in terms of a defect being introduced and since it is likely to be the most frequent action on the website could be viewed as having a high impact on the website quality. An example of a low risk change would be minor changes to the user interface.
  • Based on the risk, appropriate mitigation can then be put in place. This might involve load testing, but could be some other activity, including reviewing the implementation, testing at low loads or tracking in production.

3

Collaboration on Performance.

Fast performance engineering only happens when the relevant roles work closely together. This means performance engineers and analysts collaborating with architects, product owners, developers and testers to drive a focus on performance throughout the lifecycle. Bringing together performance and technical domain experts in this way provides the insight needed to spot problems quickly. It also means that the implementation teams gain the performance insight required to build an efficient system, while also providing the performance team with the domain knowledge to construct smart tests.

4

Smart Test Design.

In Continuous and Agile environments, we operate within narrow test windows on non-representative test environments. Smart test design is required to expose risks effectively, across the 7 pillars, within the constraints of time and environments.

5

Automation of Performance Analysis Throughout the Lifecycle.

The aim of automation is not just about speed but also to identify early warning signs of risks in non-representative load tests. In the old world, the performance testers had long test windows to conduct large and complex performance tests to re-create problems. In the new world of Agile/CI, the performance engineer does not have the time to run these large complex performance tests. In a CI cycle, you may only have 25 minutes. It is unlikely that you will recreate incidents in this test window. In order to identify risks, the performance engineer needs to look at metrics beyond conventional metrics such as response time, CPU utilisation. This means looking at metrics deep within the system to look for anomalies in behaviour that could present a risk. This requires performance engineering expertise and also domain expertise.

The Capacitas Solution to Ensure Performance
in an Agile / Devops / CI Environment

This is made up of 6 Modules as shown in the diagram below. There are three key points that need to be understood when viewing this diagram:

  • In an Agile, DevOps or CI context, there may be no clear start and end times of these activities as they are continuous.
  • Not all these modules will be carried out, the frequency and number of modules that are carried out is dependent on the level of risk of each change.
  • This is a collaborative approach with the development teams building a collective understanding of the software and how it works in the live ecosystem and in test environments.

chart-seven-alt

1

Risk Modelling

  • The performance lead ensures that all features and changes undergo a performance risk assessment during elaboration sessions and sprint planning, with appropriate maturing with business analysts and developers as required.
  • This typically includes conceptual modelling of the system to determine risks based on architectural or design decisions.
  • Following creation of the risk assessment a risk mitigation plan is put in place to ensure appropriate levels of intervention for each risk across the other Performance Engineering modules.
  • The performance lead is responsible for ensuring that all risks are mitigated appropriately and that risk levels are kept up to date and owned throughout the lifecycle.


2

Performance Reviews

  • Performance engineers work with architects, designers and developers to ensure that best practice is built into the application throughout.
  • This is based on architecture and design documents, and engagement with appropriate personnel where necessary.
  • Collaboration with developers takes place as high risk changes are developed.
  • Performance analysts check for performance anti-patterns and work with developers to eliminate issues before code check-in.

 

3

Profiling and Unit Testing

  • For high risk items the analyst works with the developers to create unit test definitions which are included in the risk mitigation plan, defining units to be tested and acceptance criteria.

  • For any areas of poor performance identified through unit testing the analyst collaborates with developers to assist with the identification of hotspots through code profiling.

  • This collaborative process pinpoints inefficiencies in the code during the development lifecycle.

 

4

Early Load Testing

  • Early tests are typically carried out in small, unrepresentative environments. In order to get round this limitation, smart tests need to be designed to expose performance risks by targeting key functionality.

  • At Capacitas, we use our proprietary software accelerator (TNT), to automatically analyse test results. TNT uses a 13 metric model to examine performance across the seven pillars, to automatically detect performance pathologies.

  • This activity is fully automated and carried out frequently.

 

5

Integration Load Testing

  • Integration load testing takes place as the system comes together, usually at the end of a cycle.

  • Scaled production-like workload mix tests are run over an integrated test environment.

  • Workload mixes can be altered to target different what-if scenarios of future user load and behaviour.

  • At Capacitas, we use our TNT software accelerator to automatically detect performance pathologies across the seven pillars and deliver rapid feedback to development teams.

 

6

Production Validation

  • After release in live, data is gathered and analysed.

  • A production health check is carried out to identify risks not observed in test. At Capacitas, we use our proprietary ‘Operational Analytics’ software accelerator to automate this analysis.

  • A before/after check on the monitoring data will be used to identify any impacts of the development work in production.

  • Findings of the production validation are fed back into the SDLC and the performance engineering cycle as continual service improvement actions.

 

Summary

In summary, capacitas believes that delivering change faster while maintaining performance requires the following five paradigm changes to the conventional performance engineering approach.

key_1Performance is not simply about response times and throughput. That is too simplistic a way to measure performance. An all-embracing approach to measuring performance is required. Capacitas’s 7 Pillars of Performance provide a comprehensive way of measuring performance.

key_2In Agile & Continuous Cycles, there is simply not enough time to test every change. A risk based approach, using techniques such as Risk Modelling is required.

key_3Shift Left & Continuous Collaboration. There needs to be a shift from performance engineers and analysts testing at the end to one where they are involved throughout the lifecycle and are collaborating with the developers to build a better understanding of the software and also refine the conceptual model of the platform.

key_4

Smart Design: smart test designs are needed to expose risks at lower loads and in narrow test windows.

key_5Automation of Testing & Analysis. Automated analysis needs to address not just response times and throughput but all 7 Pillars of Performance.

 

References

Integrating Software Assurance Into
The Software Development Life Cycle

Journal Of Information Systems Technology And Planning (2010)

Why ‘Move Fast And Break Things’
Doesn’t Work

Thomas Barns, Capacitas

Why Traditional Performance Testing Cannot Survive In An Agile And Devops World
Andy Bolton, Capacitas

Why Testing At The End Doesn’t Work
Prasham Garg, Capacitas

Automating Performance Test Analysis To Speed Up Software Delivery
Ian Donnell, Capacitas

Node.js In Flames
Netflix

 

next-step-arrow

Next Steps

If you find this whitepaper relevant and interesting, you might also find these resources helpful:

Webinar
easyJet CaseStudy - Managing Performance Whilst Delivering Faster and Implementing Rapid Technology Change

Infographic
The Seven Pillars of Performance

Capacitas Blog
http://www.capacitas.co.uk/blog

Dr Manzoor Mohammed

Dr. Manzoor Mohammed

Director

About the Author

Dr. Manzoor Mohammed has worked in the area of capacity and performance management for over 20 years.  He started his career as a performance engineer at BT Research Laboratories. He co-founded Capacitas, a consulting company which reduces cost and risk in business-critical IT systems through capacity and performance management. 

He has worked on numerous large complex projects for customers such as BT Global Services, HP, Skype/Microsoft, easyJet, Nokia etc. 

Many of these engagements have resulted in $multi-million savings in datacentre and cloud platform costs, as well as better performing and more stable systems.

Dr. Mohammed leads the R&D function at Capacitas, having developed a ‘shapes-based’ methodology for the automated analysis of performance and capacity issues, which forms the basis of a suite of data analytics tooling.

Thomas Barns-1

Thomas Barns

Principal Consultant

About the Author

Thomas is Risk Modelling and Performance Engineering Service Lead at Capacitas, responsible for service definition and ensuring consistent best practice across projects.

Over the past 10 years he has worked on large projects providing capacity and performance expertise to clients and owned the roadmap for developing Capacitas’ technical software solutions.

During this time, he has seen a big shift in how software engineering is undertaken and viewed by the business, and has built on this to introduce more effective and efficient performance risk management processes. This has meant shifting focus away from large scale system testing to a full lifecycle approach, alongside research and development in automated data analysis.

Thomas is currently defining and governing Performance Engineering processes and standards for a multi-million-pound multi-vendor programme of work at a FTSE 100 company.

Bring us Your IT Challenges

If you want to see big boosts to performance, with risk managed and costs controlled, then talk to us now to see how our expertise gets you the most from your IT.

Book a Consultation