CMG Home

Site Map Links Members Only National CMG Groups Measure IT International Conference

MeasureIT
 In This Issue
 
From the Editors

Articles >

Forecast Generation

I/O Virtualization

Measurement for Maturity (Part 2)

Capacity Utilisation

CMG News >

'07 Program Update

Press Release (05/31/2007)

Press Release (06/18/2007)

Region News >

Philadelphia

New York

Events >

Calendar

 Article Database
 Resources
 Industry Articles
 Submit Article
 SubscribeIT
 RemoveIT
 Letter to Editor
 About MeasureIT
 Contact Us
 
MeasureIT

A Case Where Disabling Intel® Hyper-Threading Solved a Deadlock Problem
June, 2004
by Ron Kaminski

About the Author
Ron Kaminski, Safeway Inc.

Ron has been a capacity planner and performance analyst since the mid 1980s, on probably every platform you can name besides a mainframe. A dedicated workload characterization junkie, Ron enjoys using multiple vendor and "home-grown" tools to collect, reduce, analyze, display and manage large-scale performance and capacity planning, as well as sharing ideas with fellow capacity planners and performance analysts.

[Hide]

At CMG 2003, I attended several great sessions that dealt with Hyper-Threading (HT), Intel’s method of executing instructions from multiple threads during one processor cycle. Given modern processor speeds, most instructions are completed very early in the cycle, so by creating additional virtual processors you can use some of that otherwise wasted power. It sounds like a great idea, and in many cases it is, but you do hear of a few cases now and then where problems are solved by disabling HT.

From a measurement perspective, HT is a black box. At the very least, we, as performance analysts, need to understand what clues might lead us to suspect HT related performance issues. Here is a real world example of what recently happened to us, which I hope will help others.

The problem

The application used a SQL Server 2000 database and there were dozens of simultaneous users. No issues were found on the old development (non HT) test machine, and a modern four-chip HT enabled machine was purchased to handle the expected load. Since HT was enabled, this machine appeared to all my performance tools as an eight-processor machine. As loads increased, performance took a serious nose-dive and SQL Server deadlock errors started to appear. Suddenly, there were many people investigating ways to cope with this serious problem. Here are the clues we noticed:

Clue 1: Your application has many simultaneous users potentially updating a common data store. As you might expect, HT seldom causes problems if the work to be done is diverse and unrelated. If you imagine your home PC, it might be checking mail, scanning for viruses, checking spelling on a word document and painting a web page all at the same time. These unrelated tasks would really benefit from the extra parallelism that HT’s additional virtual processors provide. In our case, we had many users running a homogeneous workload where they were all updating the same database tables.

Clue 2: Even during periods of the worst performance, you never use more than your single threaded CPU processor power. When performance slows, people often ring up the capacity planners to see if they need more hardware. When we examined their machine, we never saw any hour that exceeded the available single thread chip resources. IOs were minimal, and even during the worst hours hardware was plentiful and we saw no application threads in wait states . We asked if they were seeing locking issues, as we have often encountered lock-based performance issues on machines with both performance problems and lots of free resources. There were lock problems, specifically deadlocks.

Clue 3: You have deadlocks. The specific error we saw was:

    SQL State:           40001
    SQL Error Message:
    [Microsoft][ODBC SQL Server Driver][SQL Server]Transaction (Process ID 149) was deadlocked on lock | communication buffer resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

The error message told us that there were threads attempting to execute in parallel that were being locked out (from the specific portions of the database that they needed to update to continue processing) for an excessive amount of time, leading SQL Server ultimately to terminate one of the threads that was waiting on a database lock.

Clue 4: You are the first HT user of the application. When we contacted the vendor, they said that other users with similar user volumes had not encountered this problem on non HT machines. Since HT processors are just starting to appear in large numbers, you may be the pioneer. Remember, we didn’t see the problem on our non HT development machine, either.

Clue 5: Reducing parallelism didn’t help. The first thing we tried to fight the deadlock problem was to reduce SQL Server’s parallelism, i.e. the number of parallel threads it allows. This had no effect, performance during peak periods was still awful, and the number of deadlock time-out errors continued to occur at the same rate. Meanwhile, the users of the application were unhappy and we still did not have a solution for them.

The solution!

This application was set up to fail over quickly to another duplicate machine in the event of machine failure, with the disks migrating to the live machine via SAN/NAS trickery. User tempers were flaring, and many reasonable suggestions had been tried and fallen short. We had the clues mentioned above, and we’d heard vague stories at CMG 2003 of people who had to disable HT to fix an application problem. So, we suggested that they disable HT on the dormant box, and then switch processing over to it. In the worst case, it would not solve the locks, but we would be able to detect CPU wait with the tools that we have in place on each machine, and we could quickly fail back to the present machine if performance was worse. In the best case, the deadlocks would disappear and we would have better performance.

Desperation is a great motivator, so this change request was processed in record time. Our technical staff disabled HT and switched processing to the new box. No application changes were made.

Happily, it worked! The deadlocks virtually disappeared on the new machine. User-perceived performance improved dramatically. Of course, no one bothered to tell the capacity planners that things had worked out, so we sweated it out for hours, wondering if our idea helped or just delayed the resolution even longer. One thing that we did notice on the new machine was that the number of disk IOs tripled, which is usually a great sign when you are fighting locks. Subsequently, we were able to confirm that with HT disabled, the number of deadlocks dropped dramatically. Even though now there were only four (physical) CPUs handling the same workload, this I/O-bound application’s response times were dramatically improved.

At this point, wise heads might wonder if we really had a problem with HT causing threads to queue up waiting for the database locks, or just too many parallel streams for the way that database was designed. As luck would have it, very soon after performance improved, the application folks added four more identical processors (with HT disabled) to the machines for capacity reasons, and the deadlocking problem did not reoccur. Since we went from (four identical actual chips appearing as eight with problems) to (eight identical actual chips appearing as eight with no problems), it is probably fair to assume that our performance problem was directly related HT having some kind of unfortunate impact on the database transaction locking. Although we survived this crisis, we never were able to determine why HT alone seemed to have caused such serious scalability issues in this particular instance. This is a scary thought. It means we really don’t know how to avoid running into similar scalability problems in the future.

Hopefully, this will be the last time we have to tangle with this type of problem. But should it recur, we would certainly try to look deeper at the database design, specifically with the locks and waiting thread information that you can see in real-time in the SQL Server Enterprise Manager displays (and at the same information on locking delays in the Performance Monitor counters for SQL Server). We also hope to discover some technical insight into what HT does that could cause this. For instance, does anyone know coding techniques that can either induce or counteract these effects? If possible, we might even try a few more variations of Hyper-Threading on and off, if the users were amenable, and we observed a similar problem on a development machine during application stress testing.

If you are like me, when experts that you respect mention downloading and trying complex tools that you’ve never heard of and wading through miles of trace data to resolve a pressing problem, you smile pleasantly, nod and think of what you are really going to do. Often you will get recommendations that assume that you have physical access to the machine, or that you are the administrator, or that you have access to and understand the source code, or that you have weeks of time available. In most cases at large firms, you have none of the above. Still, you can make use of the tools that you do have to infer what is really going on. Remember how we used simple tools and noticed that CPU usage never was greater than the physical chip count, and how IOs increased dramatically when the lock problem was solved? Use the tools that you are comfortable with in indirect ways, keep up with the current research via CMG, and be suspicious.

In Summary

We should stress that in many cases HT is virtually (pun intended) a free lunch, it works great. Also, not all application designs are prone to this specific HT-sensitive pathology. However, if you have some of the clues mentioned above, performance is lousy, and you never use more than the basic non HT processing power that you have, consider disabling HT. You may be glad that you did, and until applications (or future versions of SQL Server) are rewritten to lock in ways less prone to these problems, your firm can save a lot of money by buying less expensive non HT processors.

Also, don’t be frightened by the complexity you may encounter in figuring these things out. There is often great documentation just a click a way or simple indirect ways to be successful. I Google™ searched to this link and other links mentioned within it to help me understand the SQL Server deadlock messages that I was seeing.

Call to Action

If you find another scenario where disabling HT helped performance, or better yet, code examples that can demonstrate the effect, help your fellow analysts and tell us about your clues in an article for MeasureIT or a CMG paper!


Related article:
CMG 2003 Trip Report by Ron Kaminski

 

Last Updated 06/05/09


Home | Conference | Groups | National | Members | Links | Site Map

Computer Measurement Group