June, 2009
by Jack Woolley, Kaiser Permanente
Kaiser has been using application network traffic analysis as part of our application performance problem determination. Unencrypted TCP/IP network activity often identifies the portion(s) of the application that result in poor overall performance.
Training in this technique has the ability to identify poor performing application modules/objects, CPU-constrained components, implementation configuration issues, and specific, poor performing application SQL.
This paper presents the basic techniques of this type of problem determination, along with several real-life examples.
"Listening to the wire", has a long, fine tradition of effectiveness. Way back in ancient times, when a warrior leader wanted to know how far away their opposing army was, he'd put his ear to the ground and hear the horses' hooves signalling the oncoming army. Thus, the leader would be able to judge the location of the enemy. In the days of steam locomotives, train bandits would listen to the train track to determine when a train would arrive.

During the United States Civil war, the telegraph sent electrical pulses through a wire to its destination. These electrical pulses interacted with the earth's weak magnetic field to create microscopic wire movement that could be heard with the human ear. A person well versed in Morse Code, could (using only his ear) listen to the "dots-dashes" and intercept the telegraph messages by "listening to the wire".
This paper demonstrates the concept that there is a great potential in performing computer performance problem determination by continuing this long tradition of "listening to the wire". Listening to the data transmission to and from a workstation can assist with application performance problem discovery.
There are many types of network monitoring tools. The network monitoring tools covered in this paper are the "bare essential" tools that a person performing problem determination should have.
These three tools are:
This is the tool that is connected to the network and eavesdrops on the data bits going to and from the workstation ("listening to the wire"). There are several different types of this tool; each one has its own advantages and disadvantages.

In general; you want to select a tool that collects all the network traffic (not only TCP/IP traffic), It needs to be able to translate the packets of network data into meaningful groupings, and it needs to be able to point out any non-standard or non-compliant data structures (network "noise").
The tool should be able to store a copy of the monitored traffic in a standard format, so that it can be analyzed later.
There are several things that MUST be included in the translation of the data traffic:

The network address of the machine that is to receive the packet
This is the address of the machine (or network component) that sent the data. Depending on the network protocol being used, it should be able to show the network interface "Physical Address" or "MAC" address (a string of 12 hexadecimal characters, unique to the machine), or the IP address (a string of four number groupings, separated by a "." For IPv4).
TCP/IP is a common protocol.... but there are many other protocols that are still in use. (And some applications give an option of several protocols, not just one.)
A translation of the actual application data that is being transported between machines.
A count of the number of bits and bytes that make up the packet.
The network time when the packet was captured on the network. This time needs to be down to the millisecond precision.
The amount of time (in milliseconds) between starting and ending the communication between these two addresses. (Later, this will be used to measure the workstation or server service times.)
Many browsers have this facility; however the capability may be disabled because of the overhead (disk space) it uses.
In essence, every web page is made up of "objects" on the screen. (An object may be a drop-down list, a small picture, a background pattern, a data entry field, a link to a different page, etc.) There may be as many as 20 or 30 objects on a single web page.
Many of the requests for these objects are made in "parallel". (This is where multiple TCP/IP connections are created, and several server requests from the workstation browser are performed at the same time.) Normally, these objects are serviced and returned by the server(s) in random order. (This is why, when you browse the web using a slow connection, you can see different parts of the screen being presented in a random sequence.)
This Browser Log needs to be able to:
It is pretty well known that a network "ping" command can indicate whether a machine is "alive" and connected to the network. And many people know that a utility named "traceroute" (or "tracert" in Windows), can show the network route from one IP address to another.
The Ping-Trace utility goes a step beyond these two basic processes. In general it:
It can be useful to have the utility perform steps three through five at a specified interval. This way you can have Ping-Trace utility monitor and log the network latency to each network stop at a constant interval. (i.e. "every 2 seconds" , "every 5 seconds", ... )
There are many tools that fit these particular requirements. And this paper is not meant to recommend the use of any specific tool. (We advise each person/company to perform their own investigation into the tools that best fit their needs.)
We did this investigation, and these were the tools that best fit our needs:
Network "Sniffer":
Network Associates; Portable LAN Sniffer
Open Source Software: Wireshark
Browser Log:
Microsoft; Internet Explorer Resource Kit
Eric Lawrence: Fiddler
Ping-Trace Utility:
Neoworx; NeoTrace

Kaiser Permanente has been using these tools for many years to perform otherwise very difficult production performance investigations. We would like to share some of our success stories with you, to demonstrate the productivity of this problem determination methodology.
This application is deployed across the country at many locations. It is a very complex application that has the reputation of performing very well.
Overall, the system has been monitored to have very good performance; however, at two specific locations the clients had consistently complained about poor performance.
We were requested to perform an on-site investigation of the reported poor response times. We arrived at 8:00 sharp and started the Network Sniffer utility and the Ping-Trace utility.

The Network Sniffer was set to monitor a specific front-desk agent who was employed to check-in people for doctor appointments.
The Ping-Trace utility was setup to monitor the IP network route between the monitored workstation and the application production server.
Initially, as we monitored the performance, the resulting response times were quite fast (and well within the response documented time agreement). These good response times continued for more than an hour. Then, at about 9:30 in the morning, the response times suddenly increased to unacceptable levels.
The Sniffer tool indicated that the lengthy response times seemed to be coming from the application server; it was taking a long time to service requests. (This was indicated by a lengthy "Delta Time" between the outgoing Workstation requests and the subsequent Server responses.)
The Ping-Trace utility indicated that there were a total of ten network hops between the workstation and the server, and that the "normal" (before 9:30) network latency was just 31 milliseconds. At 9:30, the network latency between the third and fourth network hop was averaging 380 milliseconds, with latency spikes up to 760 milliseconds.
The Ping-Trace utility shows that this was a network latency issue, not an application performance issue. A screen print of the Ping-Trace utility was given to our network support personnel. This provided them with specific IP router/switch addresses to investigate.
A few days later, the network support people discovered a network configuration error. The network route had mistakenly been configured to use only 50% of its total network bandwidth.
They corrected the configuration error, and there were no more poor response times reported from the location, all from "listening to the wire".
The clients at this medical facility were reporting mostly good response times, but they also complained about sporadic "partial screen" updates. (This is when portions of the screen update quite quickly, but other parts of the screen take much longer to be properly displayed.)
Again, at 8:00 sharp we set up the Network Sniffer utility, monitoring the client workstation. And we setup the Ping-Trace utility to monitor the network latency of each network stop between the workstation and the application server.
We observed the reported "partial screen" update issue almost immediately. Areas of the workstation screen were taking significantly longer to be displayed than other areas. And it seemed that the "slow screen updates" were random portions of the screen. Sometimes it would be the upper right corner slow/missing; sometimes it would be the lower middle of the screen slow/missing.
The Ping-Trace utility showed that there were nine network hops between the workstation and the server, and that average network latency was just 15 milliseconds. However NeoTrace also showed that there were sporadic spikes in network latency over 232 milliseconds.
Almost immediately, the Network Sniffer monitor showed incomplete packet transmissions, along with lengthy packet retransmit. (This was a clue that the building network may not be within "Cat-5" network specifications.)
Looking under the agent's desk, we discovered a networking nightmare..... The well-meaning agents had "cleaned up" the under-desk cable mess by wrapping "extra cable lengths" into orderly small loops, along with using zip-ties to existing power cables and power transformers.

We found network cables tightly wrapped around the transformers for LCD displays, the same cables wrapped tightly around small wall socket transformers, and several sections of network cables tightly tied (in parallel) with high amperage power cables .....
These actions inadvertently introduced static noise into the data communications network (interfering with data transmission). Most of these actions are specifically forbidden by network "Cat-5" specifications.
These non-"Cat-5" cable specification compliance issues were reported. And the network and building facilities groups corrected the "Cat-5" compliance problems, and the "partial screen" update complaints immediately stopped.
This web-based application provides the clients with the ability to review images of "scanned in" medical documents. The application had been deployed in one location, without client complaints, and deliveredc good response times.
However, in the initial rollout in another part of the country, the "cataloging" portion of the application was reported to be unacceptably slow. (The "cataloging" portion of the application is where the scanned image is reviewed and linked with a medical record, for long term storage.)

We were sent to perform an on-site investigation.
Just as we did at the other locations, we placed the Network Sniffer monitoring a local workstation and the Ping-Trace utility monitoring the network stops to the application server. Immediately, we observed the slow response times with the "cataloging" portion of the application.
The Ping-Trace utility indicated nine network hops between the workstation and the application server. It also indicated a 80-90 millisecond network latency for the location. (Since this was a remote location, 80-90 milliseconds was a reasonable network latency.)
The Network Sniffer tool showed that, as a result of requesting a single scanned document, over 1.5 megabytes of information traversed the network. Additional testing showed that regardless of the number of scanned pages in the document, the amount of data transferred remained roughly the same.
We knew that each page of the scanned images was about 75 K Bytes. And most of the images only had consisted of two to three pages. (On average, there should be about 250 K Bytes transmitted.) Where was all this data transmission coming from? And why?
We used the Network Sniffer packet decoding ability to take a look at the application contents of the network packets. What we saw was a full list of all the medical departments in all the locations of that Division. Then, the packets contained the names of all the possible medical procedures in the Division.
Taking a closer look at the application client interface, we discovered that the drop-down selection list had over 800 departments to choose from. And that the medical procedure drop-down selection had over 10,000 entries. Now we knew where all this added traffic was coming from.
Checking with the implementation at the other site, we discovered that they had included less than 1/10 of the drop-down selections. (They were sending less data to the workstations, so their response times were good.)
Here we have an example of a good application being made to have performance problems as a result of a large amount of customized data placed in drop-down lists. (This, with a cursory look at the client interface, would not have been noticeable.) But by using network monitoring tools, the root cause of the performance problem was discovered.
This web-based application provides "outside Kaiser Permanente", contracted medical providers information about medical procedure referrals and outside medical authorizations. The first production deployment site was reported to be providing unacceptable response times. These response times were initially measured in the 45-60 second range. We had tested this application within our testing environment, and the testing response times were very good. Clearly something was wrong.
Again, we were sent to the east coast for an on-site investigation. This was a outside-Kaiser Permanente site, and required that we mimimize the interference by our activities.. The Network Sniffer was attached to monitor the client workstation, but no other tools were allowed. Network Sniffer monitored and captured the client workstation network activity. This was saved for later (off-site) analysis.
What the later analysis discovered was that the web application displayed an average of 14-16 objects on each screen . The Network Sniffer showed that the browser object requests were being performed in parallel (as expected). However, it also showed that the application web server then single-threaded the responses back.
Back at home, this was confirmed when we used our internal workstations to access the same web application server. We reported the issue to the web server configuration team, and the next day the production web server was returning the objects in parallel. This decreased the response time to the 15-20 second range. But response times of 15-20 seconds were still considered to be excessive for this application.
We discussed several other possibilities for decreasing these response times. A detailed application flow (and browser object-flow) was performed using the data and timings provided by our Browser Log utility. It listed the names of the objects, the time they were requested, the time it took for the server to return the object, and the order in which the objects were returned.
In reviewing this information, we noticed that many "static" objects were repeatedly sent from the web server, across the network to the workstation,even though they hadn't changed since the last display.
The decision was made to cache these static (non-changing) objects on the local workstation, which eliminates the network and server delays for these objects.. When we did this, many of the application pages saw their response times decrease to a consistent two-three seconds.
However there were still two or three dynamic screens that took 14 - 17 seconds to display. The goal of the team was to get all the screens to five seconds or less.
The team then started to closely look at the reasons why the dynamic screens were slower than we wanted. We also took a close look at the physical placement of the eight to ten servers that served these dynamic pages.
What we noticed right away was that some of the servers were located in California, and some of them were located in Maryland. Most of the servers resided in Maryland. But the reverse-proxy server and the security authorization servers were in California.
To respond to each object on these dynamic screens, the data has to traverse the national network several times, and each one-way traversal added 100 milliseconds to the object response time. (We used the Ping-Trace utility to measure the cross-nation response times.)
We estimated that processing each dynamic object, (cross country and server processing), took about 3 seconds. Doing this for the 14 - 16 objects on the dynamic pages, it was easy to see the source of the remaining delay.
As a result, we moved the reverse-proxy and authorization servers to Maryland and then all the response times for the application were well under five seconds.
"Listening" to the application network activity was critical to this problem determination and eventual; solution....
Kaiser Permanente has had a purchased surgery scheduling application system for more than a decade. Recently, to comply with the national medical information security mandate, we had to move the servers that were located (locally) in the hospital, to a more secure (remote) Datacenter environment.
As they moved these servers to the Datacenter, the response time of this surgery scheduling application rose to unacceptable levels. As a result, we were requested to perform an on-site investigation.
As we have in so many other on-site investigations, we setup the Network Sniffer to monitor the client workstation and used the Ping-Trace utility to monitor the network latency between the workstation and the application server.
We immediately saw (and measured) the application's slow response times. We monitored the network for about two hours. There were ten network hops to the server and the maximum network latency was only 23 milliseconds. The increased response times were not caused by high network latency.
With the Network Sniffer, we captured the network activity for the application. What it captured was Oracle database calls... hundreds of them for each screen change. More curious than the sheer volume of database requests, was that more than 50% of the database calls responded with "Data not found". Could it be that more than 50% of the database requests returned no information? But "listening to the wire", doesn't lie.
An investigation found that this purchased application was initially coded for much more functionality than we were using. And that these ineffective database requests were the way that the application determined which of the application functions were in use, and which functions were not used. There was no way to get around this "network chatty" application code.
Using this methodology, we were able to determine the root cause of these poor response times. Final resolution of this issue is still under investigation.
These four real life application examples have shown that, "listening to the wire" can be as effective today, as it has in the past... with very productive results.
All that is required to make use of this problem determination strategy are three simple (relatively inexpensive) tools.
And finally, you need a consistent methodology of applying (using) these tools to obtain maximum effect.
The author hopes that this performance problem determination strategy will be adopted by other organizations to investigate (and hopefully resolve) other application performance issues in other organizations.