Url Redirector Modified

Techniques for Performance Bottleneck Analysis

What is a performance bottleneck?

Before we go in detail about the analysis techniques, it is important to know “What is a Performance Bottleneck”. Performance bottleneck can be anything which is an impediment to the system performance. Let us think software system as a 3 lane highway. During the off peak hours, vehicles can go as fast as possible since almost all the lanes are empty. How about the peak hours when all the lanes are fully occupied? Of course cars need to slow down and may even go bumper to bumper. Why is it so? Once the highway reaches the limit of vehicles it can handle, it will to force the vehicles to slow down due to queuing. So the capacity of the highway is a bottleneck here. It is pretty much the same scenario with the software. Your highway capacity is similar to your CPU capacity in the software. When CPU is full and busy serving the request, it cannot take any more requests and everything else will slow down. 

Let us assume that there is a road work happening in the highway and one lane is closed. Obviously, highway will reach its capacity earlier than expected. What is the reason? Vehicles are not able to utilize one of the lanes. So the issue here is the blocked 3rd lane. What does that mean in the software? Bottleneck need not be the CPU itself but can be other resources which define the work in the CPU. Best example is software thread. If the system is designed to execute 50 threads but 10 threads are blocked due to an issue, system is left with only the remaining 40 threads. So bottleneck here is the blockage in 10 threads.

Now, we talked about 2 different scenarios affecting performance, think about another scenario where highway is free and no road work. What will happen if you have mechanical issues with your car and can only drive very slow. Definitely, you have to give way to the fast moving vehicles and going to slow down many others in your lane. Well, that is exactly the same problem you see with the code in your software. If one piece of code is performing poor, it is going to affect everything around that and i n turn the whole system. Based on aforementioned, your performance bottlenecks can be any hardware resources like CPU, Memory or Storage or it can be even the software resources like threads, connections, process specific memory like Java Heap. And it can even the part of the source code itself.

What is performance bottleneck analysis?
We talked about different types of performance bottlenecks. In most of the cases, you will come to know about the problem only after the fact. And even with that you may not know the bottleneck or its root cause. That is where analysis is coming in to the picture. In the highway example mentioned earlier, how do we know what car is causing the bottleneck out of thousands of cars. Each car will be different model, different engine, different size, different speed and obviously different driving style based on the driver. Also the size of the highway will vary from place to place. There can be wide lanes to narrow lanes. So there are so many different attributes will drive the performance of a highway. That is exactly the same case with the software. So performance bottleneck analysis is an art of understanding different factors affecting the performance of the system and identifying the root cause of an issue.

What data and when to do the analysis?
As part of any performance testing, we do collect lots and lots of data from the client, network and servers. This data can vary from average response time, hits per
second and all the way up to server CPU utilization %, Web Container thread pool usage or Java Virtual Machine Heap utilization. This huge amount of data needs to be analyzed carefully to validate the performance against a service level agreement (SLA) and identify performance bottlenecks if any. Validation of performance against an SLA is comparatively easier provided the requirements are defined clearly and right tools are in place. When the result shows performance issues, it has to be analyzed in detail to identify its root cause. It requires technical knowledge, analytic skills and different techniques to effectively identify the real culprit.

The input for this analysis is the performance metrics in different layers and tiers of the system. In very high level it can be CPU utilization %, physical memory usage, IO rates, thread usage, database connection usage, method time etc... It will become more granular and complex when we consider different types of Hardware (IBM, HP etc...), Operating System (Windows, Linux, AIX etc...) and Software (Apache, Websphere, Oracle etc...) combinations.

Since analysis is based on the performance metrics, it can be done from the data which is collected from the production system or from the test system. It will be easy to collect more granular data in the test system than the production. Since the objective of the white paper is to talk about the analysis techniques, let us focus on that rather than talking about testing methodologies or technologies involved.

Analysis techniques
Techniques discussed here are Trend Analysis, Correlation, Comparison, Elimination, Drill down and Pattern matching. 

Trend Analysis: It is a technique of looking at the behavior and frequency of the performance issue and determining the bottleneck or steps in identifying the bottleneck. Performance trends can be of different types like consistent, fluctuating, increasing and even decreasing. Trend analysis can be done on both performance metrics from the end user (like response time) and also using the server metrics. You need to get answers to some basic questions here. How frequently the performance issues are appearing? Is it one time issue or multiple times? Also, is it showing any other trends like during the user spike or during the normal hours? Or is it happening during the average load of x users or only when more than x users in the system. These answers will help you to determine the right trend and obviously to determine whether it is resource issue or code bottleneck.

So to understand the analysis better, let us a look at the graphs in Figure 1 below. It shows 4 different types of client side trends from load test results.  In general it shows “Save Order” transaction response in Order Management system under user load. For the discussion purpose, let us consider the response time SLA of 5 seconds.

Trend 1: Consistent. It is an example of consistent trend. Though response time is fluctuating, over long duration it is very much consistent. “Save Order” response time is 70 seconds over the test duration and user load doesn’t seem to have any impact on the response time. What does that mean? The transaction is not within the SLA even without load in the system. This is pointing to code bottleneck and very unlikely the hardware or software resource issues.

Trend 2: Consistent with spikes. It shows response time less than 5 seconds even at higher load except few spikes during the ramp up. Is spike a problem? It depends upon what type of system it is and what is acceptable here. What could be causing the spikes?  The spikes can be due to inadequate resources in the system like threads. Some of the suggestions to analyze the bottleneck are to look at the server software resource – threads, connections, process memory, cache etc… usage over test duration.

Trend 3: Regular Peaks and Troughs. In this scenario, response time is within the SLA in the beginning of the test but uneven peaks and troughs started from 10 user load and even at constant load of 120 users. It can be seen that response times at trough is always within the SLA and response times at peaks are not within the SLA. The trend here points to a resource bottleneck. How do we know what resource is the bottleneck here? It doesn’t seem to be really caused by the user load itself. So some resource which is time bound or user bound with the load could be the bottleneck.  One of the possible culprits can be potential load balancing issue.

Trend 4: Increasing. It shows an increasing response time trend with user load. Under smaller user load, response times are within the acceptable limit but it started increasing exponentially with around 70 users in the system. At the constant load of 105 users, response time fluctuated between 30 to 100 seconds. It can be concluded that system is able to scale up to 70 users without degrading the performance after which it is not able to scale. It points to hardware or software capacity issue and advised to look at resource utilization and queuing at different application tiers.

These are some of the trends in performance issues and there can be many more trends like this. Also, similar trends can be observed in the server side metrics as well. The technique here is instead of looking at one or two data points to analyze the performance, look at the bigger picture and analyze the trend.

Correlation: Like name says, it is about establishing the relationship between performance metrics by comparing the data. The software system is always complex with multi-tier architecture, different technologies and interfaces with internal or external systems. So the easy way to do correlation is to compare the end user performance with the server side metrics. It is like comparing the trend from one set of metrics like response time with other metrics like web server CPU utilization. What do we get by correlating the data? When you see a relationship between two sets of data like increase in response time corresponds to increase in Web Server CPU utilization that uncovers area for further investigation and analysis.

Figure 2 is a good example of correlation technique. 

It shows an association between client side metrics and the server data. The first graph shows 3 major spikes in response time at load. Correlation of response time with the server metrics show direct association between response time spike and application server queuing.

Analysis shows that queuing in the application server is due to the shortage in threads in the application server. It later resulted in adding more number of threads in the server. Finally, all this can have direct association with increase in user load in the system. So the response time spikes are attributed to the temporary thread shortage in the application server due to the increase in user load.

So in summary, correlations can be done using any type of metrics. The initial trend analysis – one user issue or scalability issue, static content issue or dynamic web page issue, hardware or software issue -  can give suggest the right metrics to correlate.

Comparison: The term comparison can be interpreted in many ways and literal meaning of correlation itself is comparison. However, in this discussion comparison refers to comparing the acceptable performance metrics with un-acceptable performance metrics. How does that help in analyzing the bottleneck? Here is the answer. The client and server performance data collected during the acceptable performance gives a good baseline to understand the acceptable thresholds for the system and its resources. The same metrics from an un-acceptable performance will give you an easy way to compare and figure out what is significantly different. The places where the differences appear are the areas to look for bottlenecks. 

As the proverb says “A picture is worth of a thousand words”, so the comparison picture in Figure 3 explains this better.

Test 1 is an example of acceptable performance where the response time is below 5 seconds. Test 2 shows gradual increase in response time up to 100 second at load. For the same 2 tests, server side metrics – web container thread queue length, App server CPU utilization and DB server CPU utilization – are compared. In Test 1, queue length is not showing sustained growth which is in line with the response time. However, Test 2 queue length shows significant growth which is an indicator of backlog in the application server. So let us look at the App server CPU utilization. Test 1 CPU utilization was around 100% but still the response time was acceptable, but in Test 2 utilization reduced drastically but significant backlog of work. What does that mean? Bottleneck is somewhere else which is causing app server do things slower. So the outstanding area to look here is the DB server. The DB server CPU utilization in Test 1 is around 30%. Interestingly, in Test 2 DB CPU is hitting 80-100%. Considering all the metrics together, it can be seen that DB is doing more work than usual. Further analysis on DB usage – top wait events, slowest running SQL etc... will help to drill down further.

The comparison technique works best with other techniques like trend and correlation.  Combining these techniques with your analytical skills will help you to figure out what metrics to compare and understand where the bottleneck is.

Elimination:This is a technique where you can use your cognitive skills and as innovative as possible. The idea here is to remove certain components from the list of culprits and focus on others to identify the bottleneck. So it is like removing hardware resources and focusing on software resource bottlenecks or removing DB server bottleneck and focusing on the app server or removing the server bottlenecks and concentrates on the code.

Let us understand this using an example (Figure 4) in the Custom Application for EDI Translation. The application server has got 3 different pools of resources – (File) Translation, OR (Reporting) and Archival. The server CPU utilization is around 95% during the core processing when Translation, OR and Archival threads are in use. Since high CPU is a concern based on the application hardware sizing, we need to figure out what is causing the high CPU.  Once the Translation is completed and none of the threads are in use, CPU is still at 90%. That means OR and Archival contributes to

90% of the CPU utilization. Next comes the elimination point 2. When all the OR threads are out of use, CPU came down to 30%. It clearly shows that OR/Reporting was contributing 60% CPU.

Pattern Matching: It is a technique of comparing the performance issues in system under analysis with commonly seen issues and its causes in other systems. There are so many common performance issues like one of the server is not configured properly in the load balanced environment resulting in regular spikes in the response time,huge spike in response time in a stable system due to web server cache refreshes and even gradual increase in response time due to memory leaks. So the technique here is to compare the performance issues with commonly seen issues and its root causes in the industry.In this way, it is easy to jump on and focus on the commonly seen components.

To show this technique in detail, let us a look at a test result of Java based Order Management System at 1500 orders per hour for 48 hours. Save Order transaction is taking sub second response time in the beginning of the test but showed increasing trend in response time and finally reached 100 second response time towards the end of the test. This is a clear indication of performance degradation over time. Using the technique mentioned here, root cause of issue could be memory shortage over time. Upon analyzing the java memory usage, it clearly indicated that the used memory increased from 200 MB to 1 GB. This was due to memory leak in the system. See the below diagram for the analysis.

Drill Down: It is a very common technique to narrow down the focus to a component by drilling down. It is pretty much like start your investigation at one point and drill down further to see what exactly is the problem. This technique can be applied at any type of components or any metrics.

As most of us agree, the performance bottleneck analysis is an art than a science. So these techniques are just a guideline to do the analysis but need not be limited these. By combining these techniques and many others with analytic, cognitive and mathematical skills will definitely help you to identify the bottlenecks. Every performance bottleneck analysis is an exploration task, so in addition to all the skills, focus and perseverance is the key to the success.