The Cockcroft Headroom Plot - Introducing R

May, 2008
by Adrian Cockcroft

About the Author
Adrian Cockcroft

Adrian is best known as the author of four books including Sun Performance and Tuning (2 editions); Resource Management; and Capacity Planning for Internet Services. In his 16 years at Sun he worked in technical sales and marketing, led creation of the BluePrints best practice publishing program, tested very complex integrated systems, was a leader of Sun's Six Sigma program and was the Chief Architect and Product Boss for Sun's High Performance Technical Computing business unit. In this time he gave many training classes and consulted with a wide range of customers, most notably as the on-site capacity planning consultant for the Salt Lake 2002 and Athens 2004 Olympic Games.

Joining eBay in 2004, he initially worked for Operations Architecture, investigating new platforms and providing guidance to the capacity planning groups at eBay and PayPal. As a founding member of eBay Research Labs in 2005, Adrian helped define the initial strategy for the Labs and an Innovation Forum. He researched operations related platforms and processes, lead research into advanced Skype plugin applications, contributed to development of the Skype4Java API and prototyped advanced wireless/mobile applications. During 2006 he published an IEEE paper on simulating large scale peer to peer networks, and a CMG paper on utilization measurement problems.

Adrian has consulted on architecture, scalability and performance for the Bebo.com social network, and is an advisory board member for Infovell and Holocosmos.

In 2007 Adrian joined Netflix as a Director of Web Engineering, directing a team responsible for research and development of scalable personalized web architectures.

Adrian filed two patents on capacity planning techniques while at Sun, and four patents related to peer to peer marketplaces while at eBay.

Adrian has a blog at http://perfcap.blogspot.com where he discusses capacity planning techniques, new computer technology, and how markets and innovation

interact. He is also a member of the Homebrew Mobile Phone Club, and several local classic car clubs.

Related Papers
Data Analysis and Visual Behavior
Ray Wicks

Using Booleanized Data To Discover Better Relationships Between Metrics
Susan P. Imberman, Bernard Domanski, Robert A. Orchard

Fine-Grain Analysis (FGA): A Methodology for Analyzing Intermittent Performance Problems
Robert Berry & Jeffrey Hedglin

The Availability & Quality of SAP R/3 Workload Data For Performance / Capacity Management Process Requirements
George I. Thompson, Javier Munoz, James K. DeBruhl

See more
Join CMG

I wrote a CMG 2006 paper named, "Utilization is Virtually Useless as a Metric!". The follow-on question is what to use instead? The answer I have is to plot response time vs. throughput. And I've been thinking about a very specific way to display this kind of plot. Since I'm feeling quite opinionated about this, I'm going to call it a "Cockcroft Headroom Plot" and I'm going to try to construct it using various tools. I will work my way through the development in this and three subsequent MeasureIT articles.

The starting point is a dataset to work with. I found an old iostat log file that recorded a fairly busy disk at 15 minute intervals over a few days. This gives me 250 data points, which I fed into the R stats package for further examination. I'll also have a go at making a spreadsheet version.

The iostat data file starts like this:

extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 14.8 78.4 183.0 2446.3 1.7 0.6 18.6 6.6 1 21 c1t5d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 0 0 c0t6d0 ...

I want the second line as a header, so save it (my command line is actually on OSX, but could be Solaris, Linux or Cygwin on Windows)

% head -2 iostat.txt | tail -1 > header

I want the c1t5d0 disk, but don't want the first line, since it’s the average since boot, and want to add back the header

% grep c1t5d0 iostat.txt | tail +2 > tailer % cat header tailer > c1t5.txt

Now I can import into R as a space delimited file with a header line. R doesn't allow "/" or "%" in names, so it rewrites the header to use dots instead. R is a script based tool with a command line and a very powerful vector/object based syntax. A "data frame" is a table of data object like a sheet in a spreadsheet, it has names for the rows and columns, and can be indexed.

> c1t5 <- read.delim("c1t5.txt",header=T,sep="") > names(c1t5) [1] "r.s" "w.s" "kr.s" "kw.s" "wait" "actv" "wsvc_t" "asvc_t" "X.w" "X.b" "device"

I only want to work with the first 250 data points. So I subset the data frame by indexing the rows with an array (1:250) that selects the rows I want and leaving the column selector blank.

> io250 <- c1t5[1:250,]

The first thing to do is summarize the data. The output is too wide, so I'll do it in chunks by selecting columns.

> summary(io250[,1:4]) r.s w.s kr.s kw.s Min. : 1.80 Min. : 1.8 Min. : 13.5 Min. : 38.5 1st Qu.: 10.30 1st Qu.: 87.1 1st Qu.: 107.4 1st Qu.: 2191.7 Median : 18.90 Median :172.4 Median : 182.8 Median : 4279.4 Mean : 22.85 Mean :187.5 Mean : 290.1 Mean : 4448.5 3rd Qu.: 28.88 3rd Qu.:274.6 3rd Qu.: 287.4 3rd Qu.: 6746.6 Max. :130.90 Max. :508.8 Max. :4232.3 Max. :13713.1 > summary(io250[,5:8]) wait actv wsvc_t asvc_t Min. : 0.000 Min. :0.0000 Min. : 0.000 Min. : 1.000 1st Qu.: 0.000 1st Qu.:0.3250 1st Qu.: 0.400 1st Qu.: 3.125 Median : 0.600 Median :0.8000 Median : 2.550 Median : 4.700 Mean : 1.048 Mean :0.9604 Mean : 5.152 Mean : 4.634 3rd Qu.: 1.300 3rd Qu.:1.5000 3rd Qu.: 6.350 3rd Qu.: 5.700 Max. :10.600 Max. :3.5000 Max. :88.900 Max. :15.100 > summary(io250[,9:10]) X.w X.b Min. :0.000 Min. : 2.00 1st Qu.:0.000 1st Qu.:20.00 Median :1.000 Median :39.50 Mean :1.428 Mean :37.89 3rd Qu.:2.000 3rd Qu.:55.00 Max. :9.000 Max. :92.00

Looks like a nice busy disk, so let’s plot everything against everything (pch=20 sets a solid dot plotting character)

> plot(io250[,1:10],pch=20)

The throughput is either reads+writes or KB read+KB written; the response time is wsvc_t+asvc_t since iostat records time taken waiting to send to a disk as well as time spent actively waiting for a disk.

To save typing, I attach to the data frame so that the names are recognized directly.

> attach(io250) > plot(r.s+w.s, wsvc_t+asvc_t)

This looks a bit scattered, because there is a mixture of average I/O sizes that vary during the time period. Let’s look at throughput in KB/s instead.

> plot(kr.s+kw.s,wsvc_t+asvc_t)

That looks promising, but it’s not clear what the distribution of throughput is over the range. We can look at this using a histogram.

> hist(kr.s+kw.s)

We can also look at the distribution of response times.

> hist(wsvc_t+asvc_t)

The starting point for the thing that I want to call a "Cockcroft Headroom Plot" is all three of these plots superimposed on each other. This means rotating the response time plot 90 degrees so that its axis lines up with the main plot. After looking around in the manual pages, I eventually found an example that I could use as the basis for my plot. It needs some more cosmetic work, I defined a new function chp(throughput, response) shown below.

> chp <- function(x,y,xl="Throughput",yl="Response",ml="Cockcroft Headroom Plot") { xhist <- hist(x,plot=FALSE) yhist <- hist(y, plot=FALSE) xrange <- c(0,max(x)) yrange <- c(0,max(y)) nf <- layout(matrix(c(2,0,1,3),2,2,byrow=TRUE), c(3,1), c(1,3), TRUE) layout.show(nf) par(mar=c(3,3,1.5,1.5)) plot(x, y, xlim=xrange, ylim=yrange, main=xl) par(mar=c(0,3,3,1)) barplot(xhist$counts, axes=FALSE, ylim=c(0, max(xhist$counts)), space=0, main=ml) par(mar=c(3,0,1,1)) barplot(yhist$counts, axes=FALSE, xlim=c(0, max(yhist$counts)), space=0, main=yl, horiz=TRUE) }

The result of running chp(kr.s+kw.s,wsvc_t+asvc_t) is close...

That's enough to get started.

Look for my next article named "Cockcroft Headroom Plot - R Version" which will appear in the June issue of MeasureIT.