Why Apache Benchmark Is Not Enough

First published at Tuesday, 5 September 2017

This blog post has first been published in the Qafoo blog and is duplicated here since I wrote it or participated in writing it.

Warning: This blog post is more then 8 years old – read and use with care.

Why Apache Benchmark Is Not Enough

You are working for months on a new web application or e-commerce system and usually a few weeks or just days before the launch a complete enough feature set is running on a production-like system so that you can run a realistic load-test. Hopefully providing you with accurate picture of the performance of your future system.

To safe time and effort, you probably opt for the simple solution and use the wide-spread Apache Benchmark (ab) or siege commandline tools to setup a load-test. Both allow you to generate load on a given URL and collect performance metrics. Siege even allows you to provide a list of URLs and login if necessary with a little more effort. The results from both tools are simple numbers that are easy to communicate: users / second and average, min, max response time.

We often advise to resist the temptation of simple tools and numbers, because they are suited for benchmarking and you cannot trust the results of ab or siege to be realistic real-world load-simulation of your system.

There are a number of reasons for this:

Both ab and siege only allow you to test a single, hardcoded path through your application. You can increase the number of simultaneous users that simulate this use-case, but they are all doing it 100% exactly the same. If you are load-testing a shop system this means visiting the exact same category page, product page and searching for the same search terms.
The problem with this approach is that you will probably have a much higher cache ratio in all parts of your stack (Reverse Proxy, MySQL, Memcache, Opcode Cache, Kernel) then under real-world traffic.
You need more randomness in bot users following different paths, starting at and visiting different pages to simulate the real amount of cache misses.
The simulated click-path of your users provides just one usage scenario. The problem with this is, that you don't have more fine grained control on the realistic usage share of your features. In a shop you will have a much higher share of users just viewing products then users actually checking out and buying a product.
In more complex load-testing setups, you would define several different scenarios such as anonymous user, search-engine bot, logged in user, random traffic user, buying user and so on. Then you would configure your test to run different shares of each scenario to provide a more realistic model of your real world traffic.
ab and siege cannot be used to define complex use-cases with form submission or multi-step processes (for example a checkout). Code triggered by complex use-cases usually has a higher resource usage that can affect the performance of the other endpoints. Example: If your homepage can handle 100 users / second without traffic on any other pages, then maybe its only 20 users / second as soon as other, heavier pages are requested at the same time.
The requests / second and average response time metrics are simple to understand and communicate, but are misleading in the end.
With requests / second metric, all you can actually say is "When our site is used with this unrealistic traffic pattern, then we can handle so many users per second." Not very re-assuring, given the three previous arguments about how far of we are from real-world usage.
Second, you should never use the average to analyse response times. The average is calculated assuming a statistical normal distribution. But response time data is almost always either log-normal distributed or the distribution has peaks due to caching. This means different percentiles at 50%, 75%, 90%, 95% and 99% provide you with much better insights.

Are there reasons to use ab or siege? Yes. When you develop locally on a specific page and try to optimize it using a benchmark. You can quickly get a relative comparison of the performance before and after a change under similar traffic conditions.

But if you want a realistic estimate of the traffic your production system can handle, then you should use specialized tools such as Apache jMeter.

In addition to a UI, where you can click together complex use-cases and scenarios using different load-generating strategies, jMeter is also fully programmable to every possible use-case. It does take time to learn jMeter, but as a result you are much more flexible to run different detailed scenarios and get detailed data from every single request that you can analyze.

Subscribe to updates

There are multiple ways to stay updated with new posts on my blog:

A classic RSS feed (for example in Portalific)
I'll toot about it on mastodon
All updates will go to LinkedIn, as well