Tales from the Lab: Best Practices in Application Performance Testing
by Ellen Friedman
This article discusses best practices for the use of a performance test laboratory throughout the application development life cycle. Thorough application functional and stress testing is critical to the successful deployment of any system, whether it is brand new or a modification to an existing set of applications. We stress the importance of adhering to strict policies for performance stress testing throughout the application development life cycle in accordance with the principles of performance engineering , in addition to the more widespread practice of functional testing. The article also attempts to outline a rigorous testing methodology in the lab to ensure accurate results. In addition, it stresses the need for reporting results from lab testing to management.
A companion paper, presented at CMG2005, presents a case study that illustrates the methodology and testing principles outlined here .
Testing throughout the application life cycle
A critical success factor in developing any new application, or making changes to an existing one, is being able to subject it to comprehensive testing prior to production deployment. Thorough testing of the application prior to deployment serves several purposes:
- Verify that the application conforms to its functional specifications
- Identify application design flaws, bugs, and other errors that were missed by the developers
- Verify that the new application adheres to production standards and guidelines for deployment
- Verify that the new application integrates safely and smoothly with the current production environment
- Understand the hardware and software requirement associated with deploying the application production
- Ensure that the application meets established service levels for availability and performance
Thorough testing reduces the risk of errors and can minimize downtime in production after deployment. In order to meet these goals, it is necessary to be able to "test drive" the application early and often throughout the project life cycle, based on the best current understanding of expected production conditions.
Accomplishing these goals requires the development of a test environment that accurately simulates the behavior of the current and future production environment. A suitable testing environment allows all members of the application development planning teams, including database, applications, systems, business liaisons, and performance groups, to verify their assumptions, identify and optimize system and application design, as well as understand its hardware/software technology requirements. Comprehensive testing allows you to quantify the impact on the business of the new application.
The discipline of performance engineering provides the rationale for application performance testing throughout the application development life cycle. Stress testing the applications performance constraints should proceed with increasing rigor as the application progresses through its stages of development. Performance results that are obtained from preliminary versions of the application under development can be used to review system, database, and application design decisions. Just like other design flaws, performance problems than can be identified early on in the application development life cycle are much less costly to fix, compared to those that are not discovered until the application is presumed ready to be deployed.
Laboratory testing is required throughout the application development life cycle. Early in the project, the lab can be used to validate critical architectural design decisions. It can also be a place where the designers can experiment safely with new technologies and compare them. The lab environment continues to play a critical role in allowing the design team to understand the impact of key design decisions during later coding and integration phases. Prior to production deployment, the lab can be used for acceptance testing and, ultimately, to help plan for the production deployment. The lab can continue to play a key role in post-production change management.
The test lab is also the place where performance and volume stress testing is performed to assess the scalability of the new or re-designed application. Performance is just one other element of quality that a finished application must deliver. The performance test lab is merely a facility to pro-actively assess the satisfactory delivery of service to users prior to system implementation or roll-out. The Lab should be thought of as a vehicle that can be used to identify and resolve design and performance problems. Effective testing will minimize the risk to the business of deploying a poor quality application by ensuring that satisfactory performance levels can be achieved consistently, while maintaining good availability.
With new e-business applications accessible across the public access Internet, performance considerations are apt to be even more critical to acceptance of the application. If performance is poor, users are likely to abandon the web site and choose a different vendor to complete a purchase. Poor performance equates to lost revenue because it can lead to higher overhead costs in both hardware and personnel to support the system in production.
To answer these questions prior to production deployment, new application software must be thoroughly tested in a laboratory environment that approximates the production environment as closely as possible. It should go without saying that that the hardware and software in the lab must be a close approximation of the production environment. A well-designed test lab provides a controlled environment to support the range of testing required throughout the application life cycle.
Setting Up the Testing Labs
Testing applications during the course of their life cycle requires re-configuring the lab for multiple tests or creating multiple labs. The lab may be in a single location with hardware that is changed as the application progresses, or there may be several labs with hardware dedicated to the specialized environment that it is supporting at the time.
The locations for testing will be dependent upon various technical, business, or political contexts. The following factors can influence the decisions you make about the number of labs required and how to plan your test environment:
- Personnel who will perform the testing
- Size, location, and structure of your application project teams.
- Size of your budget.
- Availability of physical space.
- Location of testers.
- Use of the labs after deployment.
Whether the lab is housed in the same facility or across multiple physical facilities, discrete test environments are usually required to support Unit Testing, Systems Integration Testing, Stress Testing, and Systems Certification. The functions supported by each of the laboratory environments are detailed below.
- Application Unit Testing Lab
- Hardware or software incompatibilities
- Design flaws
- Performance issues
- Systems Integration Testing Lab
- User Acceptance Testing
- Application compatibility
- Operational or deployment inefficiencies
- Windows 2003 features
- Network infrastructure compatibility
- Interoperability with other network operating systems
- Hardware compatibility
- Tools (OS, third-party, or custom)
- Volume Testing Lab
- Performance and capacity planning
- Baseline traffic patterns
- Review traffic volumes without user activity
- Certification Lab
- Installation and configuration documentation
- Administrative procedures and documentation
- Production rollout (processes, scripts, and files; back-out plans)
The Unit Testing Environment is used to validate that individual features, components, or applications function properly. Unit testing begins when design starts and continues until the design is stable. Unit testing uncovers design and performance issues early-on in the project development life cycle.
The Systems Integration Testing environment is used to validate those features and components that work together cohesively. While unit tests address the depth of a component, integration tests address the breadth of a system. Inter-system communication and compatibility issues are addressed during this phase. Systems Integration Testing requires a more fully equipped test lab, where testers can carefully control test configurations and conditions.
Typically, full blown Stress Testing is conducted immediately prior to the production implementation. Stress Testing should be supported on as close to an exact replica of the production environment as possible to avoid any controversy about the interpretation of the results. A valid Stress Test environment may require full production hardware which actually simulates production as close as possible.
In contrast, performance testing can be conducted throughout the application life cycle on hardware that is a scaled-down version of the production environment. As discussed above, it is important to assess the performance of the application early and often throughout each of the testing phases. A skilled performance analyst should be involved beginning as early as the unit testing phase to uncover any unforeseen design or performance issues. A Performance Lab, that can serve this purpose, for instance, might have a limited number of web servers that constitute the front end-cluster servicing application requests. The key is being able to characterize the performance application in a way that effectively identifies bottlenecks and problems without having all the hardware that is available for the production system. If you manage the testing process correctly, you should be able to extrapolate reliably from results in the test lab with 3 web servers to the performance of the same application in the production environment with 10-20 web servers.
The same physical laboratory facility can be used to house the different functions. While it is certainly easier to create separate environments, each with its own set of dedicated servers; it may not be financially practical to do so. Regardless of where the testing is conducted, each of the basic functions must be tested.
The Testing Process
Thorough testing of a complex application requires careful planning. To gauge how well the system design performs, you must develop and design a realistic test-bed that well represents the conditions and variations you expect to encounter in production. Personnel responsible for testing need to be wary of scope creep; it is simply not feasible to test everything. Since you cannot test everything, the approach you must take to determine what aspects of the application to test is a heuristic one, not a deterministic one.
One approach is to focus on testing those aspects of the system that can help you learn the most about the range of its expected performance. This principle suggests that you should certainly develop scripts that mimic the most frequently exercised user interactions. When the focus is on performance, however, it is also important to exercise functions that stress system capacity and help you pinpoint the applications performance constraints. For this reason, it is important to test the impact of update transactions that stress the machine's I/O capacity or requests that require the server to return large amounts of data over the network.
Another important principle of stress testing is that you can only change one variable at a time as you iterate through a series of related tests. If you change multiple settings from one run to the next, it is no longer possible to determine the root cause of a particular effect.
Finally, application testing should focus on areas/events having the greatest risk factors to a successful deployment the application. These would typically encompass high volume transactions and functions that exercise critical business logic.
It is important to keep your suite of test cases manageable, as each specific test environment is devoted to a different purpose. Generally, the test cases and scripts that are developed at an early phase of testing can be re-used in later stages. Reusing the test suite in subsequent stages also provides a form of regression testing that ensures that application quality is improving at each delivery stage. Ultimately, when you reach a final system certification phase, testing needs to encompass as many functions as possible to ensure that the system will perform in production as expected.
Developing a Test Plan
First, define a test plan that describes your scope, objectives, and methodology. You can then target your environment to focus on the specific tests that are to be conducted. But you must also develop a process for prioritizing tests in the lab and the level of satisfactory completion of a test or series of tests that would allow you to proceed to the next initiative.
A project plan should be developed that addresses:
- Amount of staff and type of staff required for test execution (e.g., testers, developers, DBAs ) and staff required for test analysis (systems and database performance, DBAs)
- Test priorities
- Testing timelines based on requirements for
- Staff resource
- Specific tools or hardware
- Software/Application availability
- Software delivery schedule for production
- Hardware availability
You should design test cases that describe the test scenarios and issues that you need to address. It is extremely important to have a consistent, clear method for running the tests, evaluating and documenting the results. During the planning phase, the "problem" will be defined and the method of solving the "problem" will be to test out the conditions in a lab environment under specific conditions. The target system of hardware/software and workload will be defined along with the analysis methodology.
As part of the analysis methodology, you will define what data to collect and how to capture it. If you have a standard set of tools that are run to collect performance data, make sure that the same set of performance counters are enabled in each test, as systems overhead will be affected by the sampling frequency as well as the number of metrics measured. It is also important to have a standard set of procedures to facilitate the analysis. You may need to ARM or instrument the application in order to obtain key timings between application interfaces and to obtain volume information.
When designing the test-bed, consider what you are trying to simulate, the metrics that must be captured and summarized, and the workloads of importance. Resource profiles need to be developed for critical applications functions in the transaction mix. In some cases you may need to capture system activities such as backups, utilities, performance collectors, etc. and in other cases these activities may be ignored and only important application processes are profiled and included in the script.
Creating the Baseline
The first step in stress testing an application is creating a baseline workload that accurately reflects enough of the projected production workload so that a proper sizing and effort can be performed. First, you need to define the major business functions and map them to the corresponding application and database functions. Then, decide which functions need to be included for realistic testing and which can stubbed (or managed with placeholders) so that they are excluded from the test suite.
It is critical to establish that the baseline is both a valid and stable representation of the complete system under examination in order for it to be used reliably for prediction. One method for validating the test-bed is to compare the performance results measured in the lab with those obtained from an existing production system. Obviously, this method is not available to you for applications that are brand new without any history of their usage patterns, which makes establishing a valid baseline for them even more challenging. In the case of a brand new application, you can begin with a prototype of the application and later refine the analysis with tests conducted during the systems integration phase. By that point in time, it will be feasible to construct realistic scripts and usage patterns.
A second requirement is that the baseline workload demonstrates stability over time. You need to run the baseline workload in the lab multiple times in the same environment and measure the variability for the key performance metrics, including the response time of the application. Test workload variability can then be computed using standard summary measures of dispersion calculated for the key performance metrics. The standard deviation, the range (maximum-minimum), and the inter-quartile range (difference between the 75th and 25th percentiles) are measures of variability. You may need to understand various load-dependent factors like database or web page caching that can make stability difficult to achieve in the lab over the course of a test of limited duration. For example, the database management system used in the application may support pinned data where specific data tables are fixed in memory and non-pageable.
The baseline workload you establish is implemented as a series of scripts that are used to simulate the system using load testing software. For each major business function the application performs, you must develop scripts that will exercises those functions. To develop these scripts, do the following:
- Map the business function to the applications and systems with which it interfaces
- Identify the system processes or transactions associated with the business function
- Review application and system flow charts and modify them to document how/what will be tested.
- Develop user profiles/resource profiles for each operational scenario
- Modify any functionality to represent planned changes.
- Determine whether data and external systems are available for testing with the local application
- Identify the external system interfaces and determine which will be included in the simulation.
Note that instead of trying to interface directly with some external system in real-time during the simulation, you can usually simulate the external interface specification in the context of the testing script. For example, the data returned by calls to the external system interface might be pre-loaded into a flat file in advance for use during the test runs.
For baseline testing you need to establish the volume of application activity that needs to be simulated, using both typical (or average) workloads and expected peak loads. You also need to ensure this workload volume is distributed properly during the test. Most systems, for example, do not experience uniformly distributed arrivals rates of work requests. In the real word, arrival rates are "bursty". Varying the load of the test cycle, and not just running a linearly increasing load, is critical to emulating this "burtsiness".
Ultimately, the applications business drivers may hold the key to the workload request rate that you need to achieve. In the package shipping example in the case study that complements this article, the number of packages sorted per hour serves as a business driver. In financial systems, the number of checks processed may be the business driver of activity that must be simulated during testing.
Finally, consider what timeframes you need to simulate. Is the processing similar throughout the day or are there different workflows and business functions that must be considered that vary with the time of day? If time of day or time of the year matters, you may need to build test data and scripts to represent each of the applications critical timeframes.
Building the Lab Hardware and Software Environment
For valid predictions, the lab environment used for stress testing the application needs to simulate the production environment. The following check-list provides an example of the kinds of specific configuration and software settings that need to be checked and verified prior to executing a test in the lab. The specific list of items here is pertinent to the case study presented in . The application tested in the case study is Windows-based, some of the items included here are unique to Windows and Microsoft SQL Server applications. The example can easily be extended to other operating systems and test conditions. Specific definitions are included in the Glossary of terms at the end of article.
- Run Discovery Tools on the Application Servers in Production and Test and verify [Glossary 5]:
- OS: e.g., Windows 2000, Windows 2003 Enterprise Edition, Windows 2003 Standard Edition
- System patch levels, application versions, database maintenance and patches, other ancillary system software and monitors
- Run Discovery Tools on the Application Servers in Production and Test and verify [Glossary 5]:
- Develop a checklist for hardware. You can utilize hardware monitors or Discovery Tools to verify hardware in both production and in the lab environment. Some examples of configuration validation include:
- Verifying that the hardware in the lab matches production for all servers, i.e., Applications server, Database server, workstations and client PCs
- Verify disk drive configuration: Raid 1/0, or Raid 5? Controller settings and cache sizes. [Glossary:7 ]
- Verify CPU configuration: Is hyperthreading enabled? [Glossary:8]
- Verify memory configuration: Do you have pinned data? Do you have SQL server memory capped at a specific level? Can the OS address all the memory defined or do you require Address Windowing Extension (AWE) to address more than 4 GB of memory. [Glossary: 9]
- Other physical hardware verification checks as necessary.
- Develop a checklist for hardware. You can utilize hardware monitors or Discovery Tools to verify hardware in both production and in the lab environment. Some examples of configuration validation include:
- Develop, review and compare network diagrams for test and production. Incorporate network monitoring tools as part of the testing process in the lab.
- Conditions to verify include:
- Gigabit ethernet or 10/100
- Internet/Intranet connectivity
- Are all servers on the same subnets, similar to production?
- Where is the firewall?
- Communication to Active Directory and DNS
Identify all production servers to be part of the system (web, application, database, FTP, print, mail, etc.)
- Capture data from production to load into the test database and bring database to the proper state (e.g., End of Day or Beginning of Day state)
- Utilize ghost imaging or software such as Powerquest or Live State to save the database and system state between test runs [Ref. 2].
- Back-up and Restore server images after/before each test run to ensure repeatability of tests
- Consider running a "priming script" to get the system to the correct state prior to starting the test or consider running the test for a longer period of time and using the first 30 or 60 minutes of the test as the "priming" set-up [Glossary: 3].
Running the Performance tests
If you are careful, you can stress test the application against a more workable, but scaled down version of the expected production system. (Later for either a full application stress test or a Quality Assurance certification run, it may be necessary to run with all system components.) For example, if you suspect that the web server environment is scalable, and production will require 4 load-balanced web servers, then you should be able to initially test with 2 web servers running half of the expected load. Similarly, if you have an application farm you should be able to initially execute with a subset of the full production application server farm.
Running the performance test scripts and evaluating the results is accomplished as follows:
- Define the Server, client, and user scenarios to be tested and define them to your test harness
- Execute the tests in a controlled, repeatable fashion
- Establish the baseline for analysis and measure the test variability. Note: there should be minimal variability between baseline execution runs.
- Analyze the performance data and repeat if necessary.
- Review the performance measurements and related throughput and response time metrics to determine if there are any obvious system performance problems or application issues requiring changes.
- Document the test results
- Extrapolate any results as necessary
- Document everything and record the who, what and why of the test.
- Identify any obvious bottlenecks, test flaws etc.
- Report and review results with test team
- Escalate problems to the proper people/management levels for resolution.
- Make necessary changes in the scripts to represent "what-if" scenarios. Re-execute and evaluate results. If required, iterate steps 2-7
- Report and review results with management
- Identify any necessary changes to the application and server environment to be tested in the lab later
- Go back to the lab and adjust the test plan and timelines as necessary
After a series of tests are completed, a report detailing the findings should be prepared. A template for reporting should include an executive summary and a detailed analysis section. The executive summary highlights in non-technical language the test objectives, scope, and results; recommends the next steps to be taken; and details any action items. The results section should clearly identify whether service levels were met and provide detailed analysis of where delays in the application were found. A decomposition of end-to-end response time, highlighting the time spent in each server should be presented. The analysis section should quantify the methodology employed, the tools utilized and the results and their impact.
The results should clearly identify whether application, architectural design, hardware or database changes will be necessary to meet performance objectives. The report should point out the next steps required and if additional testing will be necessary. Sufficient detailed performance data should be provided to support your recommendations and action items.
Readers of your report will want to understand what steps taken to ensure the reliability and validity of the results. This means documenting the lab configuration and testing methodology. The hardware, software, network and OS that were used in the lab tests must be clearly itemized. Other items that need to be incorporated into your detailed report include the following:
- A list of administrative tools (standard Windows Server tools, third party, and custom-built).
- A list of the upgrades, such as service packs, drivers, and basic input/output system (BIOS), which must be installed on the OS.
Some specific items to document and track can be found in the following tables. Tables 1-4 are templates that can be modified to record test objectives, other special test requirements, necessary follow-up items, test priorities, and outcomes, as well as to summarize test results.
The first table provides an overall summary of the test scope and objectives, including the test script executed (or test case), special requirements, test purpose, and expected results.
Table 1. Identify test, scope and objectives
Table 2 below is a reporting template that is used to monitor the progress of your tests and to ensure that all follow-up issues are resolved. To track the results of your testing, use the table to record whether the test passed, failed, is in progress, or unknown. Youll also want to include the name of the person who is responsible for testing the application and the date that testing was completed or is due to be completed.
Table 2. Track the test results
The template shown in Table 3 is used to document test prioritization and any special test considerations.
Table 3. Priority of Test Cases
Table 4 shows an example of items to track and record regarding the test outcome and any necessary follow-up items.
Table 4. Test Outcomes and Follow-up Items
Reporting should also include the issues you uncover/track and what initial tests to conduct as part of your testing strategy. You should extend this list with additional issues that are appropriate for your organization.
At this point, we have discussed the methodology for developing an application stress test plan and have identified the issues you might encounter in building different types of lab environments to support testing. We emphasized the need for testing throughout the application life cycle, as well as the key components to monitor and track during testing. We have reviewed the specific requirements for performance testing, especially the need to establish a valid baseline prior to test execution.
In Part II of this article, to be presented at the CMG 2005 conference, we focus on implementing the methodology and applying it to a specific case study. The case study will present various testing scenarios and a comparison of their results; it also illustrates the differences between lab findings and actual production measurements after initial deployment. The case study will include specific details regarding test accuracy and validity, and performance data collection and analysis for Windows applications. We will also review pitfalls to avoid when testing and the clear need for documenting results and validating the baseline test cases.
Glossary of Terms
Measures of Dispersion: measures of variability or spread of the data. For test validity, we expect to see small differences between test runs as quantified by measures of variability (dispersion). We expect variability to be less than 5-10%. If test variability is greater, it is difficult to measure efficacy of any change.
Measuring Efficacy: This is where we quantify whether or not the change had a significant impact. We expect the change to be greater than the variability observed when repeating the baseline. Typically we would expect to see at least a 10-20% change in order to prove that the change was effective or had a positive impact.
Test Stability: Need to run and evaluate tests after steady-state is achieved. There is normally a lot of overhead when starting a script. As an example, initially all datasets must be opened and buffer pools are not at their optimal state until transactions are executed. We dont want to compromise evaluation of test effectiveness by including the beginning of the run. Often times, we need to discount the "end of the run" as the load is "winding-down". Typically, one measures at a stable or level loading.
Software Performance Engineering: As defined by Connie Smith- " provides a systematic, quantitative approach to constructing software systems that meet performance objectives" (Ref: Smith 1990, 1992 CMG Proceedings and www.perfeng.com)
Discovery Tools: Tools which find and track hardware and software over LAN or WAN networks. These tools are typically used to track and identify configurations to keep track and inventory IT Assets.
SQL Server Capping: A method of reducing or fixing the amount of memory that SQL server is allowed to utilize. One can fix the minimum and maximum allowable to try to keep the OS from stealing any memory from SQL. The danger is that if SQL is given too much memory, it can "starve the OS". Need to monitor the value chosen to balance what is available to SQL vs. what is available for other system and application functions. To limit SQLs memory footprint, set the value of max server memory.
>Raid 10, Raid 5 RAID (redundant array of independent disks) levels 0, 1, and 5 are typically implemented with databases.
RAID 10 or Raid 1+0: This level is also known as mirroring with striping. This level uses a striped array of disks, which are then mirrored to another identical set of striped disks. For example, a striped array can be created using five disks. The striped array of disks is then mirrored using another set of five striped disks. RAID 10 provides the performance benefits of disk striping with the disk redundancy of mirroring. RAID 10 provides the highest read/write performance of any of the RAID levels at the expense of using twice as many disks. Because it offers the highest read/write performance, it is most often utilized for databases.
Raid 5 - also known as striping with parity. It stripes the data in large blocks across the disks in an array and also writes the parity across all the disks. Data redundancy is provided by the parity information. The data and parity information are arranged on the disk array so that the two are always on different disks. Striping with parity offers better performance than disk mirroring (RAID 1). However, when a stripe member is missing, read performance degrades (for example, when a disk fails).
Hyperthreading: With Hyper-Threading technology, one physical Xeon processor can be viewed as two logical processors each with its own state. The performance improvements due to this design arise from two factors: an application can schedule threads to execute simultaneously on the logical processors in a physical processor, and on-chip execution resources are utilized at a higher level than when only a single thread is consuming the execution resources. Some applications cannot make use of hyperthreading.
Address Windowing Extensions In Microsoft® SQL Server 2000, you can use the Microsoft Windows® 2000 Address Windowing Extensions (AWE) API to support up to a maximum of 64 gigabytes (GB) of physical memory. The specific amount of memory you can use depends on hardware configuration and operating system support.
Pinned Data: A concept typically used for Oracle databases, where specific tables are essentially defined as fixed in memory and non-pageable.
1. Performance Solutions: A Practical Guide to Creating Responsive, Scalable Software. Dr. Connie Smith, Dr. Lloyd Williams, Addison-Wesley Press
2. "Performance Testing in the Lab," Ellen Friedman, CMG Proceedings, 2005.
3. Symantec, Live State: http://www.symantec.com/index.htm
4. Windows 2000 Performance Guide. Mark Friedman, Odysseas Pentakalos, OReilly Associates, 2002.
5. Windows 2000 Resource Kit. Microsoft Press
6. SQL Server 2000 Performance Tuning. Microsoft Press