Why do so many online services repeatedly fail at high load?

Addressing Poor Availability

Poor application availability continues to blight IT organisations. Degraded user experience results in loss of business and revenue and has significant financial and reputational impacts. Typically these incidences occur after releasing new functionality and during peak business periods.

This situation has not changed since commercial IT applications first started to be developed and deployed. There are litanies of war stories with a common theme of poor application performance. A survey of 100 IT directors conducted by Vanson-Bourne in 2006 found that only one in ten firms had any performance testing in place. At that time a third of firms surveyed admitted that the impact of poor application performance costs them over £1 million per annum. For 15% of companies this rose to over £2 million each year.

More recently the situation has not improved - if anything the levels of performance have worsened. This is primarily due to advent of cloud computing and other complex technical architectures, and business appetite to introduce change quickly. However there are other factors that need to be considered.

The Black Friday event is a recent example of how many commercial websites are not able to accommodate peak traffic flows, becoming slow and unresponsive. Our monitoring of Black Friday 2015 continued to highlight the same shortfalls in performance seen the year before. Apparently organisations are not changing their practices to avoid such problems.

Why does this happen?

Several key factors explain why performance risks continue to manifest through software delivery and into production.

Complex modern architectures are costly to change and difficult to manage. Factors that increase complexity include: use of cloud, integrated technology solutions, extensions to legacy systems, multi-tiering, chatty applications, multiple/intensive workload processing, etc.
Performance requirements for mission critical applications are often not given serious consideration. Applications are rarely designed to meet performance criteria even though performance modelling methodology can be utilized to validate criteria (as in our Risk Modelling service). Invariably the modular approach to design taken for multi-tiered architectures does not consider the overall end-to-end effects of the design. This is often hampered by having separate teams with responsibilities for discrete components.
Software quality is often poor with little attention to ‘performance efficient’ coding. Developers underutilized their knowledge of performance accelerators and inhibitors (E.g. use of caching, query performance, recursive calls, memory handling, etc.) and fail to relate the change back to performance criteria. Delivery pressures invariably exacerbate this dilemma.
There is a perception that performance testing is the most effective and economical means of mitigating performance risk. However, this invariably exacerbates the ‘shift-right’ paradigm as performance risks tend to emerge late in the lifecycle, and potentially after release.
Excessive reliance on Application Performance Management (APM) tools as the single solution to isolate and identify the root-cause of performance problems. APMs are rarely deployed well enough to realize potential, and similarly suffer from the ‘shift-right’ paradigm.

Is the solution to do more Performance Testing?

Whilst more performance testing will help, on its own it is not a complete solution and need not be applied in all situations. It is a certainty that more investment in this area would protect many more firms from adverse consequences. Investment in performance testing needs to be targeted according to the performance risk profile of a given set of changes. This ensures that testing effort is applied economically.

Integrated performance testing requires a stable application and is prone to overruns because of inherent complexities and dependencies. Therefore, the testing conclusions tend to come late on in the delivery cycle, and often result in protracted and expensive remediation work. This engenders the shift to the right often causing project overruns. The pressure to release can result in performance risks not being identified or being minimised.

Software Performance Engineering - A more complete solution

Risk modelling yields most value when conducted at beginning of the project and utilized to drive a more comprehensive performance engineering response. This is an ideal starting point to determine the scale and nature of performance risk and empowers projects to make objective decisions on how to proceed, removing any guesswork.

Software performance engineering works most effectively when initiated right at the start of the requirements and design phases. At these stages it is far more economical to establish coherent performance requirements alongside good design with efficient & scalable coding practice. Performance modelling methods should be used to initially validate performance objectives and identify significant design flaws before a test script has been executed. This ensures that a risk based approach to performance testing completes successfully, delivering the user experience that your customers need.

If you would like to learn more about our Prepare for Peak and Performance testing solutions, please click below, to see our latest Ebook.

Insights