Author by Claire Cates, SAS
Much of my career has been centered on memory management. I’ve learned that many developers don’t understand some of the basic underlying memory concepts which can affect application performance. Therefore, I will try to share some of my knowledge via a series of short articles. I hope you enjoy and feel free to ask any questions. My work is strictly in the “C” area, but almost all of what I will discuss can be applied across all varieties of machine and languages. My article will be on memory caches.
Over time, CPUs have become much faster; yet, the time to access main memory has not kept up. Instead, hardware designers have been including more and more caches closer to the CPU in order to boost memory access performance. Unless your application is designed with memory caches in mind, your application will suffer performance problems. This first post will define common memory cache terms.
Memory Cache Terms
The memory cache is a piece of fast access memory that is located closer to the CPU than main memory. The caches were created to improve performance when frequently accessed pieces of memory are used. Since the caches are closer to the CPU, the latency to obtain the memory is reduced.
Latency – the amount of time that the CPU has to wait “stall” for the memory to be loaded from either the cache or main memory.
There may be multiple levels of memory caches in the system. The Level 1 or L1 cache is the closest to the CPU and is typically integrated into the CPU. The L2 cache is further from the CPU and therefore slower to access but it still has access times that are superior to main memory; furthermore, the L2 cache is larger than the L1 cache. Many systems contain another level of cache, the L3 cache.
Below is a diagram showing a dual processor with a shared L2 cache. Notice that the memory in the L1 cache is copied thru the L2 cache and is in main memory.
To give you an example of what some cache sizes are, I logged into lax94t02, one of our Linux machines, and found that it has an L1 cache size of 32K, an L2 cache size of 256K and an L3 cache size of 12288K. Do note that the number of cache levels and the size of caches will change across different hardware systems.
As your program executes and the CPU needs to access memory, it will first look into the L1 cache to see if the memory is available. If the memory is found in the cache it is a cache hit. If the memory is not found, it is a cache miss and the CPU will look in the L2 cache and any other cache levels until the memory is located. After exhausting all caches, the CPU will go to main memory to access the piece of memory. As it accesses the memory, the memory will be copied into all the caches between where it was found and the CPU. Cache misses are costly in with respect to performance.
When memory is accessed it is accessed in a cache line. A cache line is typically 64 or 128 bytes. If the CPU only needs 4 bytes of memory and the cache line is 64 bytes long, the full 64 bytes, aligned on a cache boundary and containing the needed 4 bytes, will be placed into the cache. If the 4 bytes spans a cache line, then 2 cache lines will be placed in the cache.
As the program continues to run, the caches will become full. When the cache is full and the CPU needs to place another cache line in the cache, a cache line will need to be evicted from the cache. How the CPU determines which cache line will be evicted is called the replacement policy. One of the most common replacement policies is Least Recently Used (LRU). The oldest cache line in the cache will be evicted to make space for the new cache line. When a cache line is evicted from the L1 cache it will probably remain in the L2 cache since the L2 cache is much larger.
When data is written to memory, the cache line containing the memory will be updated in the L1 cache and the cache line will be marked dirty. A dirty cache line will need to eventually be written back to main memory. How and when that memory is written to main memory is the write-back policy. If no other CPU is using the same cache line, main memory typically will not be updated with the new data until the cache line is evicted. If another CPU has a private copy of the same cache line, the data must be written back to main memory and updated in all private copies of the cache line that exist. The process of making sure all private copies of a cache line for all threads have updated “correct” content is called Cache Coherence.
Another term you may have heard is memory bandwidth. Each access from the CPU to main memory basically flows thru a pipe and multiple CPUs will share this pipe. If too much memory is flowing “at the same time” thru the pipe to the cores, then the pipe becomes saturated and this will further increase the latency (time) to access memory. The possibility of bandwidth saturation increases as more and more cores access main memory.
Besides the memory cache, there are two other memory caches in a system.
Translation Lookaside Buffer cache (TLB) – The translation look-aside buffer is used to speed up the virtual to physical address translation. Applications do not access physical memory directly; instead, the memory is accessed virtually. The virtual memory is allocated in pages which are fixed size blocks (pages) and are mapped to areas of physical memory. The page table is the translation from the virtual address to the physical address of the memory. The page size is the fixed block size. The TLB cache contains the recent mapping from the page table. If a TLB miss occurs, the page table must be searched and this can be a slow process.
The Instruction Cache - As instructions are executed they need to be loaded into the CPU. The instructions will be loaded into the instruction cache. Since code contains many loops, function calls in loops, and other constructs that cause the same instructions to be executed multiple times, storing the instructions in an instruction cache can speed up the execution as the instructions are often reused.
One final cache concept I would like to introduce is prefetching. The system the application is running on may have hardware prefetching turned on. When prefetching is turned on, the CPU will guess at what memory is going to be accessed next and attempt to load that memory into the cache so that it will be available when the CPU needs to access the memory. Prefetching often works well and can save latency since the data will already be in the cache. When the prefetcher guesses incorrectly, it can be detrimental. Not only will cache lines be evicted that might have been reused, but the unused prefetched cache lines will consume valuable bandwidth for no gain.