Workload Characterization And Modeling For High Performance Computing
November 1, 2003
by Michael A. Salsburg
Introduction
In the domain of High Performance Computing (HPC), benchmarks are available to compare and contrast algorithms and hardware architectures. Some of the currently popular benchmarks that stress test various aspects of the architecture are:
- STREAM - CPU/memory bandwidth
- Lmbench - memory/CPU latency
- SPECfp - floating point performance
- LINPACK - combination of the above 3
- HPL - LINPACK + MPI (Interconnect) + I/O
Many application-focused benchmarks are being evaluated today. The following are some application-level benchmarks proposed by the Standard Performance Evaluation Corporation (SPEC) organization:
- SPECchem2002 - quantum chemistry
- SPECenv2002 - weather forecasting
- SPECseis2002 - seismic, petroleum location
- SPECOMP2000 - OpenMP for SMP systems
These benchmarks provide empirical data regarding the performance of a specific computer configuration, but they are not helpful in gaining insights into the underlying performance issues that surround specific computer architectures. There is a current need for an abstraction of the workload to a level where one can easily evaluate and propose the proper hardware configuration. The workload characterization could also be applied to the research and development of new computer architectures that can be designed to address the specific characteristics of the workload.
Most workload characterization for High Performance Computing (HPC) is at a highly granular nature, in which memory traces reveal cache hit rates at the various levels of CPU caching. They also identify "hot spots" in the code where significant CPU time is being consumed and therefore must be optimized.
There is a very good reason for the focus on memory access. Although miniaturization continues at a fairly predictable rate, the end result is faster and faster CPU speeds and larger and larger memory chips. The actual memory latency times are not reducing at the same rate that CPU speeds are increasing. Figure #1 shows the differential in the trends.

Figure #1 - Comparison of CPU and Memory Frequencies
(Sources: ITRS, Intel, Samsung)
The Performance Evaluation Research Center (http://perc.nersc.gov/main.htm) is devoted to the scientific study of HPC performance evaluation in general. It is an excellent source for information and knowledge collaboration. At the present time, most of the efforts are focused on workload characterization and modeling at a granular scale where the codes are adjusted to optimize memory access times, either by increasing the cpu cache hit rates or by adjusting the message passing protocols.
Although memory access optimization is critical for HPC applications, this level of characterization does not address higher-level issues, such as:
- What are the I/O characteristics of HPC applications and how can I/O configurations affect performance?
- What are the interconnection characteristics and how do changes in the topology / latency / throughput affect the application?
- What is the tradeoff between SMP systems and Clustered Systems or Clusters of Multiprocessors (CLUMPS[6])?
- What speed-up can we expect with the next generation of commodity CPUs?
These issues will become more important as HPC performance evaluation moves away from memory-intensive benchmarks and starts focusing on more and more diverse applications. Another factor that comes into play is that today’s academic applications, which are primarily researched and examined within a well-controlled environment will be tomorrow’s business applications, which will need quick deployment and minimal administration and tuning. As such, they will require performance analysis that can only focus on first order effects. For example, detailed instrumentation may not be possible.
This paper is focused on a "macro view" of workload characterization and performance modeling. It reviews some of the industry standard, fundamental principles of workload characterization for computer performance evaluation, providing insights into recent enhancements. It then proposes a new enhancement to address the basic architectural issues that are emerging regarding HPC.
The primary requirement for workload characterization, in this paper’s context, is to establish the essential performance statistics to be used to drive a high-level performance model. This high level model is intended to investigate the interplay of HPC applications and the underlying system (hardware and software) on which it is executing. There have been many papers and books written on the subject of workload characterization, so this paper is not intended to cover the literature (an excellent overview is provided in [1]). Instead, this paper is intended to present the basic motivation for workload characterization and to help in the process of establishing what statistics need to be gathered. The actual technicians who work with HPC benchmarks, applications and hardware specifications can then provide their own expertise in how the data can be assumed or gathered.
Fundamentals of Workload Characterization and Modeling
Formal workload characterization has its roots in the original modeling effort to understand the performance characteristics of the Time Sharing Option (TSO) on an IBM 360 system in the late 1960s. The basic statistics used to drive that model are still considered important to understand the first order impact of workload and hardware configurations on computer performance.
The Central Server Model is a basic building block in computer performance evaluation. It provides an abstract model that can be quite useful in understanding the performance characteristics of the system.

Central Server Model
In this basic model, transactions are shown as arriving in the "System", using cpu and disk and eventually leaving the system. The memory portion is usually not modeled in the high level approach due to the detailed nature of determining cache-hit probabilities. It should be understood that, for finer and finer detail, memory modeling might come into play.
At the high level of the model, a transaction arrives, uses the cpu, and then branches off with a certain probability to use a disk. Once the disk request completes, the transaction is either completed or it goes back to another cpu request, again branches off to use a disk and so on until the transaction exits. Note that multiple transactions may be active in the system at the same time, thus there is a probability that a request for service either from the cpu or disk will result in waiting time in a queue. The busier the cpu or disks become, the higher the queueing probabilities.
The individual request times for either the cpu or disk are considered to be random variables, with a known average service time and distribution. For example, if the disk service time is specified as 5 milliseconds in the model, it is understood that this is a uniform random variable with an average value of 5 milliseconds. Similarly, the requests for service from the cpu are usually considered to be exponentially distributed. These probability distributions have been determined from years of industry experience and measurements.
A simple central server model can be decomposed into two disjoint parts: the workload characterization and the configuration. With decomposition comes the ability to independently change workload or configuration characteristics and investigate "what-if" scenarios. The primary resulting statistic that is used to establish "good" performance from "bad" performance is the transaction delay time, from when it first enters the system to when it completes. For example, the model could be used to see that, if the transaction arrival rate doubles, the delay time will increase by a factor of 5. The next "what-if" scenario could then explore the effects of using a CPU that is twice as fast and examining its ability to reduce the transaction delay time.
The configuration can be abstracted into a number of individual components. These components have specific performance attributes. The workload characterization can be expressed as average values for measured or assumed statistics. The following are the component attributes and workload statistics, as described in detail in [3] and [5].
The overall system can be decomposed into a relatively small number of workload types. For example, a complex weather forecasting application could have activity that decomposed into LU Factorization, Graphics rendering and Mesh resolution.
For each workload type, the following statistics are necessary:
- Average Arrival Rate (arrivals/second)
- Average CPU Time per arrival (in ms, measured on a specific CPU style)
- Priority for CPU processing (integer)
- Average number of disk requests (I/O / s)
For each disk, diski where i = 1..n, the following statistics are necessary:
- Probability of requesting diski
- Average disk transfer size (MB)
- Probability of a disk seek
- Probability of disk cache hit
In a simple Central Server model, the configuration consists of two component types, disk and cpu.
The attributes for a cpu type are:
- Number of processors
- CPU rating, using a standard methodology such as SPECint and SPECfp
The attributes for a disk type are:
- Average Latency Time (ms)
- Transfer Rate (MB / s)
- Average Seek Time (ms)
Given these necessary workload characterization statistics and configuration attributes, one could estimate the performance implications for various changes, such as:
- increased transaction rate
- changes in the distribution of I/Os
- changes in the average CPU time per transaction
One could estimate the effects of reconfiguration:
- Replace the disk with a faster device
- Replace the CPU with a faster CPU
The central server model has been effectively used over the last three decades to address architectural issues and to provide a high level methodology to predict bottlenecks and plan for capacity. It is not a panacea for all levels of performance evaluation. For example, one cannot, from this simple model, estimate the effects of better affinitization or a change in the caching algorithms or cache size for the CPU. On the other hand, it has been used to effectively manage performance for the Fortune 500 companies since they computerized four decades ago.
Extensions to the Central Server Model
Over the years, there have been many extensions to the central server model. Again, a number of books and papers have been published on this subject. Some extensions that have a direct bearing on HPC performance evaluation are now discussed.
The I/O subsystem has become far more complex over the years. Additional detail, including the addition of disk controllers, cache hit rates and solid-state disk, have been added. Distributed processing has continued to gain importance, from grid computing to HPC clusters to general internet commerce. The details within [3], [4] [5] and [6] provide insights into a method for extending the central server model so that workload characterization and configuration attributes can be used to explore performance within a distributed processing environment. These extensions are mainly focused on what has now become a standard template for internet commerce, where the functionality is broken into three logical tiers: web service, application logic and database processing.
In the interoperable model, as presented in these papers, a disk sub-model is replaced by a remote sub-model that can then provide a recursive definition for a distributed system. The graphic below shows this extension.

Enhanced Central Server Model for the Interoperable Enterprise
In this extension, the branching probabilities for disk access are modified to include requests that go through a network and then execute on a remote central server. Note that this can then recursively define many such branches until the transaction is satisfied. Additional delay is handled in the model by the "network" sub-model. Although the network sojourn is shown graphically as a single object, this could constitute a number of "hops" from network appliances through links until the destination server is reached. The attributes for the various network components are described in detail in [3] and [4].
Workload characterization is augmented by replacing some of the specific disk statistics with transmission statistics:
- Average size for send message (MB)
- Average size for return message (MB)
- Replace the statistics for diski with the central server statistics for the next level
In summary, this section provides motivation and insight into the collection of a minimal set of statistics and attributes so that the behavior of workloads on configurations can be explored and new insights can be gleaned from an abstract performance model. These statistics are, for the most part, available by examining time-based performance statistics that normally gathered by the operating system. For example, CPU utilizations and disk I/O rates are available from the Windows Performance Registry or UNIX sar commands.
Additional instrumentation is required to determine transaction rates and deriving how the disk statistics are partitioned into the individual workload types. In a benchmarked environment, these statistics can be gathered by understanding the benchmark or by instrumenting the codes to collect this information. In addition, there are hardware-based tracing facilities and tools that can be used to gather real-time characterization data.
Workload Characterization for Parallel Systems
The Central Server model, as discussed above, with its extensions for distributed processing, can be used to explore HPC applications, but the parallelism inherent in HPC workloads are worth additional attention. The workload characterization of parallel systems, as discussed in [1], offers some key insights and leads to an additional extension to explore some of the high level issues encountered with HPC workloads.
For most commercial distributed workloads, a "transaction" proceeds, for the most part, in a sequential manner, requesting service from a CPU, a disk, another disk, a remote CPU, remote disk and so on until the transaction completes. HPC processing, on the other hand, achieves its performance by taking a problem and partitioning the problem into smaller tasks that then execute independently and in parallel. A typical workload will have one or more controlling tasks that then send a number of child tasks to many systems to work in parallel. When their tasks complete, there is a synchronization point where the results are resolved at the central task. New child tasks are then issued and so on until the processing is completed.
The synchronization of these independent tasks can have a large impact on the overall system performance. For this reason, it is suggested that synchronicity be added to the central server model. The following graphically shows this extension.

Extension for Parallel Processing
In this extension, the remote sub-model is replaced by a parallel system sub-model (ps). The central server model at the top portion is the location of the controlling task. It issues a parallel request to a number of central server sub-models (two are shown in the figure, but more are usually deployed). In general, the processing fans out to spawn a number of parallel tasks to be completed. It is implicitly understood that the overall request does not complete until all of the individual tasks fan into the central task.
In [1], a very general method for workload characterization of parallel systems is presented, in which task graphs are used to show all of the possible fan-in / fan-out steps within the workload. The model presented here assumes that a workload type will be in two general states - processing on the central task (fan-in) or parallel processing of a number of subordinate tasks (fan-out). The term out-degree refers to the number of tasks that will execute in parallel. As the out-degree increases, most applications will reach a point where the dominant contributor to delay will be in the form of communications, versus CPU and I/O processing. For example, a scenario is described in [2] where, ideally, 30 Sequent Symmetry systems could outperform a Cray Y-MP, but communication delays prevented this from occurring. One of the key areas to explore when modeling at this high level is where the efficiency of adding more systems in parallel ceases to increase.
Another important issue that should be explored for parallel processing applications is the inherent delays in synchronization (fan-in). In essence, the overall wait time will be the maximum of the wait times for the individual components. If these wait times are simple distributions, such as a uniform distribution, the average wait time for the maximum delay can be calculated using Order Statistics. For example, assume you are selecting from two streams of random numbers that are uniformly distributed with average of 1/2. With each trial, you record the maximum value. It can be shown algebraically that the average will then approach 2/3. For three streams, the average will approach 3/4. As the number of "streams" increases, the average of the maximums will approach the maximum value of the distribution. Note also that the synchronization delay can be affected by the variance in distribution of delays. As the variance increases, the maximums will increase.
The parallel distribution of tasks can essentially reduce the efficiency of the cpus used in the configuration due to either the synchronization delays or an unequal distribution of work for the tasks that are executing in parallel. A crucial element here is the ability to carefully balance the workload over all of the parallel tasks.
The following graph is reproduced from [1] and shows the effects of solving a matrix block decomposition problem with a 24x24 matrix. Note that all of the processors are busy for only 15% of the execution time. For more than 25% of the time either one or no processors are computing.

Distribution of CPU Utilization for 16 Processors
Gathering Empirical Data
The last sections were mainly descriptive in nature. This section focuses on a prescriptive approach. As suggested above, system activity and utilizations can be gathered from the operating system fairly easily. In addition to this information, workload-specific metrics need to be gathered. If the application is repetitively generating a unit of work, the frequency for the completions should be measured. For example, assume that a number of simultaneous equations are being solved. A log file should be created in which the completion is logged, along with the size of the equation.
If the workload is predominantly one of issuing central commands to run parallel tasks, the following should be collected:
- Number of fan-outs per time interval
- Average and standard deviation for the wait until completion of each task that fans in.
Again, this could be stored in a log for each iteration, or the statistics can be accumulated and reported in terms of activity per unit of time, with average and standard deviation reported.
For the tasks themselves, they will need to be instrumented to separately report the amount of time (average and standard deviation) that they are active on a specific system.
If a large number of I/Os are issued, either by the central controlling process or the individual tasks, the I/O count should be reported.
This application activity, along with the system level activity and utilization statistics, should be sufficient to provide the necessary statistics for a high-level performance model.
Finally, the general profile of the workload may evolve over time. For example, the first part of the application may be focused on gathering and filtering data. The second part may consist of intensive numerical activity. The final part may be devoted to rendering a graphical image. In this example, the workload is not homogeneous. Therefore, one should partition the workload into smaller segments, where each segment is a homogeneous workload. They are in effect very different computing systems. Each workload will then have to be analyzed separately.
Conclusion
The study of workload characterization and performance modeling started at the dawn of computing. When the first computers were devised to work on the "Manhattan Project", benchmarks were immediately run to determine if there was enough elapsed time to solve the necessary equations. Workload characterization and simple modeling methods were employed to make intelligent decisions regarding equipment purchases and configuration options. Today, we are simulating the detonation of nuclear weapons and still faced with the same issues. A great deal of the recent literature has focused on the highly detailed issues of cpu/memory interactions and cpu cache behavior. In general, HPC applications are much broader and need to be understood at a higher level when faced with macro issues, such as distribution of computing activity over a cluster of systems or the most efficient methods to manage data storage and retrieval. Higher-level workload characterization will be beneficial in understanding some of the alternatives and in gaining more knowledge about the applications themselves.
Finally, today’s fundamental High Performance Computing research will, in most likelihood, reach a maturity level where a number of techniques and applications will enter the mainstream of commercial computing. Once it is deployed at that level, there will be a need to quickly evaluate ongoing performance and plan for the capacity of the system. A higher-level approach to workload characterization and performance modeling will play an essential role at this phase of the application’s life cycle.
Bibliography
[1] Calzarossa, M, Serazzi, G, "Workload Characterization: A Survey", Proceedings of IEEE, 1993.
[2] Foster, I and Michalakes, J, " MPMM: A Massively Parallel Mesoscale Model", Parallel Supercomputing in Atmospheric
Science, pages 354-363. World Scientific, River Edge, NJ 07661, 1993.
[3] Salsburg, Michael - "Capacity Planning in the Interoperable Enterprise", CMG Proceedings, 1994
[4] Menasce, Daniel A. & Dregits, Duane On the Design and Implementation of a Capacity Management Tool for LAN Environments, CMG Proceedings, 1995.
[5] Salsburg, Michael - "Network Performance Modeling", CMG Proceedings, 1996.
[6] Cappello, Franck and Etiemble, Daniel, MPI versus MPI+OpenMP an the IBM SP for the NAS Benchmarks, www.sc2000.org/techpapr/, 2000.