Underutilization Epidemic Caused by Distributed Systems -- Be the Cure!
Authors by HuyAnh Ngo, IBM Corporation and
Kaleb Walton, Security Inspection, Inc
When a traditional relational database design doesn’t scale, datacenters install an expensive 50 node Hadoop cluster. Fixing the underutilized cluster on nights and weekends can be solved utilizing a distributed kernel, public cloud and improved design practices to come as close as possible to solving underutilization.
Underutilization Epidemic Caused by Distributed Systems
Last year you finally realized your traditional relational database design didn't scale. Your sharding¹ approach eliminated your ability to mine your entire data set without complicated federation. This year you give yourself a solid pat on the back for springing for a new 50 node Hadoop cluster to replace your database.
However, as you take a closer look at the utilization across the cluster you notice the same trends as your traditional database - spikes of activity during the week days and idling at night and weekends. You get a sinking feeling as you notice similar patterns on your other application servers. You've paid hundreds of thousands of dollars on a cluster that spends half of its time doing nothing, and your legacy infrastructure has been suffering from this problem for years!
Underutilization of computing resources has been a nagging problem since the dawn of computing. However, with the new age of internet scale computing bringing about machine clusters that are reaching into the hundreds of thousands of nodes, the problem of underutilization has become an epidemic.
Although perfect resource utilization is unattainable, let's explore some tools and techniques that will get us as close as possible.
Traditional Systems Design Falls Short
To fully appreciate where we find ourselves at the end of this journey, let's start at the beginning and walk through an evolution of systems design patterns.
The Centralized System pattern results in one application performing all service functions on one node (often with multiple nodes running highly availablility with load balancing). If over-utilization occurs, a bigger machine must be provisioned (referred to as scaling up). System resources sit idle and underutilized during off peak usage times.
Private Distributed System
In a Private Distributed System the application has been separated into a controller, which is responsible for scheduling the work to be done, and a worker which is responsible for performing tasks that accomplish the work. The controller is usually deployed as highly available and load balanced on multiple nodes, with workers running on multiple nodes as well.
If overutilization occurs you usually would increase the number of worker nodes (referred to as scaling out). System resources sit idle and underutilized during off peak usage times, and is even worse than a centralized system because there are more machines being underutilized.
Shared Distributed System
The Shared Distributed System pattern is almost the optimal solution. In this pattern, controllers and workers from multiple distributed applications share the same resources. As in the Private Distributed System, if overutilization occurs you just scale out. Although idle time is still experienced, it is reduced by a wider variety of usage patterns by the multiple distributed applications.
Resource contention is introduced as controller and worker processes fight for the same limited amount of shared resource space on each node. Managing the resource contention becomes a complicated management exercise. Because of the resource contention problem, the Shared Distributed System pattern is impractical at scale.
No Longer Just a Dream
In the system design patterns above, each application runs on a node comprised of a centralized operating system kernel responsible for managing each application's requests for the resources on a single machine. Each application is written with the notion that it will execute under a centralized kernel with a limited number of CPUs, RAM and disk available for processing.
Imagine if you could write your application against a kernel that had nearly unlimited CPUs, RAM and disk available for processing. Sounds like a dream, right? In recent years a set of tools have been open sourced and brought to mainstream development that make this dream a reality, enabling a Distributed Kernel pattern of system design.
In a Distributed Kernel environment each application is written against a kernel that manages resource requests with a pool of resources limited only by the size of the machine cluster. Every application is then executed from within the distributed kernel runtime environment versus being spread out to many centralized kernels on multiple machines. Resource contention is addressed as there is a nearly unlimited pool of resources for applications to request. Over-utilization is addressed by scaling out and adding generic resource nodes, which is simpler than in other distributed systems where you need to add nodes specialized for the work being performed. Underutilization is much easier to manage since it is visible through a single distributed kernel, however, long running processes are still taking up resources and are often idling.
Although it is much more manageable, idle time is still experienced in a Distributed Kernel - but there is more we can do!
Optimize for Distribution
To really maximize the utilization of the distributed kernel you need to rethink how you design your software. Whereas long running processes or 'daemons' reserve resources for use, short running processes or 'tasks' ask for resources when they need them and release them when they are finished. Tasks execute very efficiently within a distributed kernel.
Take for example a web application container. Typically it is a daemon that receives client requests and executes request handling functions that perform any number of operations against databases, web services and so on to render a web page. When requests aren't being served the daemon is just taking up system resources that could be used by other applications.
Knowing that we are working within a distributed kernel environment, we can take a different approach to architecting a web application container. Although a daemon is still required to listen for client connections, the request handling can be delegated to tasks that can be executed within the distributed kernel. This leaves the daemon to only concern itself with managing client connections and returning results from request handling tasks, which uses up significantly less resources than before and makes them available for other tasks to utilize.
A web application container is a trivial example of how tasks can be pulled out of a daemon, however, the same technique can be applied to any application. Once you have pulled most of your application's processing functions out into tasks, you will be operating nearly as efficient as possible. However, even in this situation you will experience underutilization during off peak usage times!
Cloud to the Rescue
Before the existence of public clouds, there really wasn't any way to deal with the problem of peaks and valleys in utilization. Technically, as cumbersome as it might be, you could power down physical hardware during off peak usage times. However, you're only saving on bandwidth and energy since you have already paid the fixed cost for the hardware itself - it's just not usable.
In today's public cloud environments you can provision and de-provision virtual machines on demand. To tackle the problem of underutilization you can add machines during the day when utilization is high, and remove them on nights and weekends when utilization is low. Couple this with the distributed kernel which simplifies your infrastructure down to a single cluster, you now have a full handle on your utilization!
Because your machines are virtual, performing the operation of provisioning and deprovisioning them has become practical. The cost model is also viable as when you de-provision virtual machines you stop paying for not only the bandwidth and energy, but the hardware itself.
Be the Cure
As an industry we have applied centralized system designs to tackle our internet scale problems, resulting in an epidemic of underutilization. “Be the cure” by investing in a distributed kernel and following design practices that take full advantage of the new computing environment. Your developers and system administrators will thank you and so will your pocketbook.
¹Horizontal partitioning is a database design principle where rows of a database table are held separately, rather than being split into columns. Each partition forms part of a shard, which may be located on a separate database server(s) or physical location(s).