Common Bottlenecks in Performance Tests

First published at Tuesday, 19 April 2016

This blog post has first been published in the Qafoo blog and is duplicated here since I wrote it or participated in writing it.

Warning: This blog post is more then 9 years old – read and use with care.

Common Bottlenecks in Performance Tests

Most developers by now internalized that we should not invest time in optimizations before we know what happens exactly. Or as Donald Knuth wrote:

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. [1]

This is true for optimizations in your PHP code but also for optimizations regarding your infrastructure. We should measure before we try to optimize and waste time. When it comes to the assumed performance problems in your system architecture most people guess the root cause will be the database. This might be true but in most projects we put under load it proved to be false.

So, how can we figure out where the problems are located in our stack?

You need to test the performance, as always. For optimizing your PHP scripts you'd use XDebug and KCacheGrind or something like Tideways. You'd trigger a single request and see what is slow in there and optimize that part.

System Test Setup

It is slightly more complicated to test your full stack. In the optimal case you simulate the real user behaviour. It is definitely not sufficient to just run ab or siege against your front page. For an online shop typical user tasks could be:

Browsing the catalogue (random browser)
Product searches
User sign up & login
Order checkout

We usually discuss the common tasks on a website with the owner of the website. Then we discuss the numbers for each of those task groups which we should simulate to mimic a certain scenario. With this information a jMeter test can be authored simulating the real behaviour of your users. After a test run you can compare the access logs of the test with common access logs from your application to verify you simulated the right thing – if those logs are available.

Stack Analysis

Once you wrote sufficiently generic tests you will be able to simulate larger numbers of users accessing your website – some may call it DDOS'ing your own page. When running these tests you can now watch all metrics for your system closely and you'll be able to identify the performance problems in your stack.

There are couple of tools which might help you here, but the list is far from extensive and depends a lot on the concrete application:

vmstat watches CPU usage, free memory and other system metrics
iftop shows if network performance is an issue
Tideways or XHProf for live analysis of your application servers

On top of that you should definitely watch the error logs on all involved systems.

It is Not Always The Database

The root cause for performance problems on websites we put under test weren't rooted in the database for most of our test runs:

Varnish & Edge Side Includes (ESI)
We tested a large online shop which used ESI so extensively that the application servers running PHP were actually the problem. The used framework had a high bootstrap time so that this was the most urgent performance impediment. The database wasn't even sweating.
Network File System (NFS) locking issues
Once you put a high load on a server sub systems will behave differently. NFS, for example, tries to implement some locking for a distributed file system. When multiple servers are accessing the same set of files it can stall your application entirely. Something you will almost never hit during development but in load tests or later in production.

There are even configuration issues which only occur under load and degrade your performance more then any slow query will do.

Broken server configurations
In one case a web hoster who claimed to be specialized on cluster setups provided us with a setup where we ran the tests on. The cluster allowed a lot more FPM children to spawn then database connections. Once put under load the MySQL server rejected most of the incoming connections which meant the application failed hard.
Opcode cache failures
Wrong Opcode cache (APC, eAccelerator, …) configuration can even degrade PHP performance. But this is also something you will not notice without putting the system under load. So you'll only notice this when many customers try to access your website or during a load test.

Summary

If you want to ensure the application performs even under load you should simulate this load before going live. The problems will usually not be where you thought they would be. Test, measure and analyze. The time invested in a load test will be a far more efficient investment then random optimizations based on guesswork.

Kore will talk about "Packt Mein Shop das?" ("Will My Shop Manage This?") at Oxid Commons in Freiburg, Germany on 02.06.2016 about load testing.

1: Computing Surveys (PDF), Vol 6, No 4, December 1974

Subscribe to updates

There are multiple ways to stay updated with new posts on my blog:

A classic RSS feed (for example in Portalific)
I'll toot about it on mastodon
All updates will go to LinkedIn, as well