The Complete Memory Hierarchy: From Semiconductor RAM to Virtual Memory
The performance gap between processors and main memory has been widening for decades. Modern processors can execute billions of instructions per second, but accessing data from main memory takes hundreds of processor cycles. This disparity creates a fundamental bottleneck that would cripple system performance if left unaddressed. Caches emerge as the elegant solution to this problem, acting as high-speed memory buffers that store frequently accessed data close to the processor. Understanding how caches work is essential for anyone seeking to comprehend modern computer architecture or optimize software performance.
The Memory Hierarchy and Where Caches Fit
Computer systems organize memory into a hierarchy based on two competing factors: speed and cost. At the top of this hierarchy sit the processor registers, which are the fastest storage locations but extremely limited in number. At the bottom lies secondary storage like hard drives or SSDs, offering vast capacity but painfully slow access times. Between these extremes, we find main memory (RAM), which provides reasonable capacity at moderate speed but still cannot keep pace with modern processors.
Caches occupy the critical middle ground between registers and main memory. They are built using the same fast SRAM (Static Random Access Memory) technology as registers but in larger quantities. Main memory, in contrast, uses DRAM (Dynamic Random Access Memory), which is slower but cheaper and denser. The memory hierarchy typically includes multiple levels of cache, each progressively larger and slower as we move away from the processor. This hierarchical organization exploits the principle of locality: programs tend to access a relatively small portion of their address space at any given time.
The principle of temporal locality states that recently accessed memory locations are likely to be accessed again soon. Spatial locality suggests that memory locations near recently accessed addresses are also likely to be accessed. Caches leverage both forms of locality by storing not just individual bytes but entire blocks of contiguous memory. When a processor needs data, it first checks the fastest cache level. If the data resides there, we have a cache hit and the processor continues without delay. If not, we suffer a cache miss and must search the next level of the hierarchy, ultimately reaching main memory if necessary.
Memory Hierarchy (fastest to slowest):
Processor Core
|
v
+---------+ <-- Registers (few bytes, < 1 cycle)
| CPU |
| Core |
+---------+
|
v
+---------+
| L1 | <-- Level 1 Cache (32-64 KB, 1-3 cycles)
| Cache | Split: I-Cache (instructions) + D-Cache (data)
+---------+
|
v
+---------+
| L2 | <-- Level 2 Cache (256 KB - 1 MB, 10-20 cycles)
| Cache | Usually unified (instructions + data)
+---------+
|
v
+---------+
| L3 | <-- Level 3 Cache (8-32 MB, 30-70 cycles)
| Cache | Shared across cores
+---------+
|
v
+---------+
| Main | <-- DRAM (4-64 GB, 100-300 cycles)
| Memory |
+---------+
|
v
+---------+
|Secondary| <-- SSD/HDD (TB, millions of cycles)
| Storage |
+---------+
The Cache Hierarchy: L1, L2, and L3
Modern processors typically implement three levels of cache, each serving a distinct purpose in the performance optimization strategy. The Level 1 (L1) cache sits closest to the processor core and is the smallest and fastest. Most architectures split L1 into two separate caches: the instruction cache (I-Cache) holds program code, while the data cache (D-Cache) stores program data. This Harvard architecture split allows the processor to fetch instructions and data simultaneously without conflict. A typical L1 cache might be 32 or 64 kilobytes per core, accessible in just one to three processor cycles.
The Level 2 (L2) cache is larger but slower than L1, typically ranging from 256 kilobytes to 1 megabyte per core. Unlike L1, the L2 cache is usually unified, meaning it stores both instructions and data without distinction. When the processor misses in L1, it checks L2 before resorting to the even slower L3 cache. The access time for L2 might be ten to twenty cycles, still far better than accessing main memory but noticeably slower than L1. This intermediate level provides a critical buffer that catches many accesses that miss in the tiny L1 cache.
The Level 3 (L3) cache is the largest and slowest of the on-chip caches, often ranging from 8 to 32 megabytes or more. In multi-core processors, L3 is typically shared among all cores, serving as a common pool of fast memory that facilitates inter-core communication and reduces redundant storage. Accessing L3 might take thirty to seventy cycles, but this is still dramatically faster than the hundred to three hundred cycles required to fetch data from main memory. The large capacity of L3 means it can hold substantial portions of a program's working set, significantly reducing the frequency of expensive main memory accesses.
The Cost of Building Caches
The decision to implement caches and their specific configurations involves careful economic and engineering tradeoffs. SRAM, the technology underlying caches, requires six transistors per bit of storage, compared to just one transistor and one capacitor per bit for DRAM. This means SRAM occupies roughly six times the silicon area of DRAM for the same capacity. Silicon area directly translates to manufacturing cost, making large SRAM caches prohibitively expensive. Additionally, SRAM consumes more power than DRAM, and since caches must be fast, they often run at or near the processor's clock frequency, further increasing power consumption and heat generation.
The physical distance between the processor and the cache also matters tremendously. Even signals traveling at nearly the speed of light take measurable time to traverse the millimeters of silicon on a modern chip. This is why L1 caches are tiny and placed immediately adjacent to the processor core, while L3 caches can be larger because they are allowed to be farther away and slower. Every increase in cache size represents a tradeoff: more storage capacity means more silicon area, more power consumption, potentially longer access times due to increased complexity, and higher manufacturing costs.
These economic realities explain why we cannot simply make enormous caches to eliminate the memory hierarchy problem. A processor might have just 128 kilobytes of L1 cache total (split between instructions and data), perhaps 1 megabyte of L2, and 16 megabytes of L3. Meanwhile, main memory provides 16 gigabytes or more. The ratio between cache and main memory capacity is enormous, yet that tiny cache handles the vast majority of memory accesses because of locality. The art of cache design lies in finding the sweet spot where size, speed, cost, and power consumption balance to deliver optimal performance for typical workloads.
Cache Organization: Lines, Words, and Bytes
To understand how caches work, we must first understand their basic organizational units. The fundamental unit of cache storage is the cache line, also called a cache block. A cache line is not a single byte or word but rather a contiguous chunk of memory, typically 64 bytes in modern systems. When the processor needs a single byte from memory, the entire cache line containing that byte is fetched and stored in the cache. This design exploits spatial locality: if the program needs one byte, it probably needs the adjacent bytes soon.
Each cache line contains multiple words, and each word contains multiple bytes. The definition of a word is architecture-specific, but in modern systems, a word is typically 4 bytes (32 bits) or 8 bytes (64 bits). Consider a cache line of 64 bytes on a 64-bit architecture where a word is 8 bytes. This line contains eight words, and each word contains eight bytes. The internal structure looks like this:
Cache Line (64 bytes total):
+-------+-------+-------+-------+-------+-------+-------+-------+
| Word0 | Word1 | Word2 | Word3 | Word4 | Word5 | Word6 | Word7 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 8B | 8B | 8B | 8B | 8B | 8B | 8B | 8B |
Each Word (8 bytes):
Word = Byte0 + Byte1 + Byte2 + Byte3 + Byte4 + Byte5 + Byte6 + Byte7
The processor generates memory addresses to access specific bytes in main memory. These addresses must be decoded to locate data within the cache structure. The address breakdown depends on the cache's organizational parameters: the total cache size, the line size, and the associativity (which we will explore shortly). Understanding how addresses map to cache locations is fundamental to grasping cache behavior and performance.
Word-Addressable vs Byte-Addressable Memory
Memory addressing schemes come in two primary flavors: byte-addressable and word-addressable. This distinction affects how we interpret memory addresses and how we break them down to access cache contents. In a byte-addressable architecture, each individual byte has a unique address. Most modern architectures, including x86, ARM, and RISC-V, are byte-addressable. In a word-addressable architecture, each word has a unique address, and individual bytes within a word are accessed by specifying both the word address and a byte offset.
Consider a byte-addressable system with 8-byte words. Address 0 points to byte 0, address 1 points to byte 1, address 8 points to byte 8 (the first byte of the second word), and so on. Every byte has its own address from 0 to the maximum addressable memory. Now consider a word-addressable system with 8-byte words. Address 0 points to word 0 (bytes 0-7), address 1 points to word 1 (bytes 8-15), and so on. To access byte 5 in the word-addressable system, you would specify word address 0 and byte offset 5.
Byte-addressable systems are more flexible and align better with modern software paradigms where programs manipulate individual bytes, characters, and variable-length data structures. However, they require slightly more complex address decoding. When we break down a byte address to access a cache, we must identify which line contains the byte, which word within that line, and which byte within that word. In a word-addressable system, the address directly identifies the word, and we only need to determine the cache line and word position within that line.
Let us examine a concrete example for a byte-addressable system with 64-byte cache lines and 8-byte words. Consider address 156 (in decimal). First, we determine which cache line contains this address by dividing by the line size: 156 ÷ 64 = 2 remainder 28. This address resides in cache line 2 (the third line, counting from 0), at byte offset 28 within that line. To find the word within the line, we divide the byte offset by the word size: 28 ÷ 8 = 3 remainder 4. This means the address points to word 3 within the line, specifically byte 4 within that word.
Address 156 breakdown (byte-addressable, 64-byte lines, 8-byte words):
Address: 156 (decimal) = 0x9C (hex) = 10011100 (binary)
156 ÷ 64 = 2 remainder 28
└─> Cache Line 2, Offset 28
Offset 28 ÷ 8 = 3 remainder 4
└─> Word 3, Byte 4
Visual representation:
Cache Line 2: [0-63] contains addresses 128-191
Byte Offset: 0 8 16 24 32 40 48 56
| | | | | | | |
v v v v v v v v
Line 2: [W0][W1][W2][W3][W4][W5][W6][W7]
^
|
Offset 28 = Word 3, Byte 4
Sets, Ways, and Generic Cache Structure
Caches are organized into sets and ways, a structure that determines how memory addresses map to cache locations. A set is a group of cache lines, and a way is one of multiple possible locations within a set where a particular memory block can reside. The number of ways in each set defines the cache's associativity. If each set contains only one way, we have a direct-mapped cache. If each set contains multiple ways, we have a set-associative cache. If the entire cache is a single set with many ways, we have a fully-associative cache.
Consider a generic cache with S sets and W ways per set. The total number of cache lines is S × W. Each cache line stores one block of memory along with metadata. The metadata includes a tag that identifies which memory block is currently stored in that line, a valid bit indicating whether the line contains valid data, and potentially other bits for cache coherence and replacement policy tracking. When the processor generates a memory address, we must determine which set to check and then compare the address's tag against all ways in that set to see if the data is present.
The address breakdown for a set-associative cache divides the address into three fields: the tag, the set index, and the block offset. The block offset identifies the specific byte within the cache line. The set index identifies which set to search. The tag distinguishes between different memory blocks that could map to the same set. Let us examine the structure of a generic 2-way set-associative cache:
Generic 2-Way Set-Associative Cache Structure:
Way 0 Way 1
+---+----+-------+ +---+----+-------+
Set 0: | V | Tag| Data | | V | Tag| Data |
+---+----+-------+ +---+----+-------+
Set 1: | V | Tag| Data | | V | Tag| Data |
+---+----+-------+ +---+----+-------+
Set 2: | V | Tag| Data | | V | Tag| Data |
+---+----+-------+ +---+----+-------+
Set 3: | V | Tag| Data | | V | Tag| Data |
+---+----+-------+ +---+----+-------+
...
V = Valid bit (1 bit)
Tag = Tag field (variable bits)
Data = Cache line data (typically 64 bytes)
Memory Address Breakdown:
+---------+-----------+--------------+
| Tag | Set Index | Block Offset |
+---------+-----------+--------------+
| | |
| | └─> Which byte in the line?
| └─> Which set to check?
└─> Which specific memory block?
When accessing the cache, the processor uses the set index to select one set, then simultaneously compares the tag field of the address against the tags in all ways of that set. If any way's tag matches and its valid bit is set, we have a cache hit and retrieve data from that way's data field using the block offset. If no tags match, we have a cache miss and must fetch the data from the next level of memory hierarchy. The choice of how many sets and ways to implement represents a fundamental design decision that affects both performance and hardware complexity.
Direct-Mapped Caches: One Way Per Set
A direct-mapped cache represents the simplest mapping strategy. Each set contains exactly one way, meaning each memory block maps to exactly one location in the cache. For a direct-mapped cache with N lines, we have N sets and 1 way per set. This design offers the advantage of simplicity: determining whether a block is in the cache requires checking only one location, making the hardware fast and inexpensive. However, it suffers from high conflict miss rates when multiple frequently accessed memory blocks map to the same cache line.
The address breakdown for a direct-mapped cache is straightforward. We divide the address into three parts: the tag, the line index (since each set has one line, we can call it a line index rather than a set index), and the block offset. The number of bits in each field depends on the cache parameters. The block offset requires log₂(block_size) bits. The line index requires log₂(number_of_lines) bits. The remaining high-order bits form the tag.
Let us work through a detailed example. Consider a direct-mapped cache with the following specifications: total cache size of 1 KB (1024 bytes), block size of 16 bytes, and a 16-bit address space. First, we calculate the number of cache lines: 1024 bytes ÷ 16 bytes per line equals 64 lines. The block offset requires log₂(16) equals 4 bits to address any byte within the 16-byte block. The line index requires log₂(64) equals 6 bits to select among the 64 lines. The tag gets the remaining bits: 16 - 4 - 6 equals 6 bits.
Direct-Mapped Cache Example:
Cache size: 1 KB (1024 bytes)
Block size: 16 bytes
Address space: 16 bits (64 KB addressable)
Number of lines: 1024 / 16 = 64 lines
Address breakdown (16 bits):
+------+----------+--------------+
| Tag | Line | Block Offset |
| 6b | Index 6b | 4b |
+------+----------+--------------+
15 10 9 4 3 0
Cache Structure (64 lines, 1 way):
Way 0
+---+----+-------+
L 0: | V | Tag| Data | 16 bytes
+---+----+-------+
L 1: | V | Tag| Data |
+---+----+-------+
L 2: | V | Tag| Data |
+---+----+-------+
...
+---+----+-------+
L63: | V | Tag| Data |
+---+----+-------+
Now let us trace several memory accesses through this cache. Consider accessing address 0x0A34 (binary: 0000101000110100). Breaking down this address: bits [15:10] give tag equals 0b000010 (decimal 2), bits [9:4] give line index equals 0b100011 (decimal 35), and bits [3:0] give block offset equals 0b0100 (decimal 4). The processor checks line 35. If the valid bit is set and the stored tag equals 2, we have a hit and return byte 4 from that line's data block. If not, we have a miss.
Address 0x0A34 access:
Binary: 0000 10 | 10 0011 | 0100
Tag: 2 Index: 35 Offset: 4
Check line 35:
+---+----+-------+
L35: | 1 | 02 | Data | <-- V=1, Tag matches (02), HIT!
+---+----+-------+
^
|
Return byte at offset 4
Next, consider accessing address 0x4A34 (binary: 0100101000110100). The tag is 0b010010 (decimal 18), the line index is 0b100011 (decimal 35), and the block offset is 0b0100 (decimal 4). Notice that this address maps to the same line 35 as the previous access but has a different tag. If line 35 still contains the block from address 0x0A34 (tag 2), we have a conflict miss. The cache must evict the old block and load the new block with tag 18.
Address 0x4A34 access (conflict with previous):
Binary: 0100 10 | 10 0011 | 0100
Tag: 18 Index: 35 Offset: 4
Check line 35:
+---+----+-------+
L35: | 1 | 02 | Data | <-- V=1, Tag doesn't match (02≠18), MISS!
+---+----+-------+
After replacing:
+---+----+-------+
L35: | 1 | 18 | Data | <-- New block loaded, tag=18
+---+----+-------+
This example illustrates the primary weakness of direct-mapped caches: conflict misses. If a program alternately accesses addresses 0x0A34 and 0x4A34, the cache will miss on every access because both addresses map to line 35 but have different tags. This thrashing behavior can severely degrade performance. Despite this limitation, direct-mapped caches are popular in practice because of their simplicity, speed, and low hardware cost. Many modern L1 caches use more complex associative designs, but direct mapping remains important in certain contexts and as a conceptual building block.
Set-Associative Caches: Multiple Ways Per Set
Set-associative caches strike a balance between the simplicity of direct-mapped caches and the flexibility of fully-associative caches. In an N-way set-associative cache, each set contains N ways (N cache lines). A memory block can reside in any of the N ways within its designated set, reducing conflict misses compared to direct mapping. The hardware must compare the address tag against all N ways in parallel, which increases complexity and power consumption but remains practical for moderate associativity levels.
The address breakdown for a set-associative cache is similar to direct mapping, but we now have fewer sets than total cache lines. For a cache with C total lines and N-way associativity, we have C/N sets. The set index field requires log₂(C/N) bits, the block offset remains log₂(block_size) bits, and the tag gets the remaining bits. The increased tag size compared to direct mapping (because we have fewer index bits) provides better disambiguation between different memory blocks.
Let us examine a concrete example: a 4-way set-associative cache with 2 KB total capacity, 32-byte blocks, and a 16-bit address space. The number of cache lines is 2048 bytes ÷ 32 bytes equals 64 lines. With 4-way associativity, we have 64 ÷ 4 equals 16 sets. The block offset requires log₂(32) equals 5 bits. The set index requires log₂(16) equals 4 bits. The tag requires 16 - 5 - 4 equals 7 bits.
4-Way Set-Associative Cache Example:
Cache size: 2 KB (2048 bytes)
Block size: 32 bytes
Associativity: 4-way
Address space: 16 bits
Number of lines: 2048 / 32 = 64 lines
Number of sets: 64 / 4 = 16 sets
Address breakdown (16 bits):
+-------+----------+--------------+
| Tag | Set | Block Offset |
| 7b | Index 4b | 5b |
+-------+----------+--------------+
15 9 8 5 4 0
Cache Structure (16 sets, 4 ways):
Way 0 Way 1 Way 2 Way 3
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
S 0: | V | T | D | | V | T | D | | V | T | D | | V | T | D |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
S 1: | V | T | D | | V | T | D | | V | T | D | | V | T | D |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
...
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
S15: | V | T | D | | V | T | D | | V | T | D | | V | T | D |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
V = Valid bit, T = Tag (7 bits), D = Data (32 bytes)
Let us trace memory accesses through this cache. Consider accessing address 0x15A8 (binary: 0001010110101000). The tag is bits [15:9] equals 0b0001010 (decimal 10), the set index is bits [8:5] equals 0b1101 (decimal 13), and the block offset is bits [4:0] equals 0b01000 (decimal 8). The processor checks all four ways of set 13 in parallel. If any way has valid equals 1 and tag equals 10, we hit. If not, we miss and must choose which way to replace.
Address 0x15A8 access:
Binary: 0001010 | 1101 | 01000
Tag: 10 Set:13 Offset: 8
Check set 13 (all ways in parallel):
Way 0 Way 1 Way 2 Way 3
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
S13: | 1 | 05| D | | 1 | 10| D | | 1 | 22| D | | 0 | --| - |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
^
|
Tag matches! HIT in Way 1
Return byte at offset 8
Now consider accessing address 0x8DA8 (binary: 1000110110101000). The tag is 0b1000110 (decimal 70), the set index is 0b1101 (decimal 13, same as before), and the block offset is 0b01000 (decimal 8). This address also maps to set 13 but has a different tag. The processor checks all four ways of set 13. Suppose none of the tags match. We have a miss and must load the block from memory.
Address 0x8DA8 access:
Binary: 1000110 | 1101 | 01000
Tag: 70 Set:13 Offset: 8
Check set 13:
Way 0 Way 1 Way 2 Way 3
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
S13: | 1 | 05| D | | 1 | 10| D | | 1 | 22| D | | 0 | --| - |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
No tag matches (05≠70, 10≠70, 22≠70), MISS!
Way 3 is invalid, use it for replacement:
After loading:
Way 0 Way 1 Way 2 Way 3
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
S13: | 1 | 05| D | | 1 | 10| D | | 1 | 22| D | | 1 | 70| D |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
The beauty of set-associative caches becomes apparent when we compare them to direct-mapped caches. In our earlier direct-mapped example, addresses 0x0A34 and 0x4A34 conflicted because they mapped to the same line. In a set-associative cache, multiple addresses mapping to the same set can coexist peacefully as long as we have enough ways available. If a program alternates between four different memory blocks that all map to set 13, our 4-way cache can hold all of them simultaneously, while a direct-mapped cache would thrash miserably.
The choice of associativity level involves tradeoffs. Higher associativity reduces conflict misses but increases hardware complexity, power consumption, and access time. Each additional way requires another tag comparator and another multiplexer to select the correct data. Most modern processors use 4-way or 8-way set-associative L1 caches, while L2 and L3 caches often use 8-way or 16-way associativity. The specific choice depends on the target workload characteristics and the available silicon budget.
Fully-Associative Caches: Maximum Flexibility
A fully-associative cache represents the opposite extreme from direct mapping. The entire cache is a single set containing all cache lines as separate ways. Any memory block can reside in any cache line, providing maximum flexibility and eliminating conflict misses entirely. The only misses in a fully-associative cache are compulsory misses (first access to a block) and capacity misses (cache is full and a block must be evicted). However, this flexibility comes at a steep hardware cost.
In a fully-associative cache, the address breakdown has only two fields: the tag and the block offset. There is no index field because there is only one set. The tag must distinguish between all possible memory blocks that could fit in the cache, so it contains all address bits except the block offset bits. For every memory access, the cache must compare the tag against every single cache line in parallel, requiring many tag comparators and significantly more power than set-associative designs.
Consider a fully-associative cache with 32 KB capacity, 64-byte blocks, and a 32-bit address space. The number of cache lines is 32768 bytes ÷ 64 bytes equals 512 lines. All 512 lines exist in a single set (512-way associativity). The block offset requires log₂(64) equals 6 bits. There is no index field (or equivalently, 0 bits for the index). The tag requires 32 - 6 equals 26 bits.
Fully-Associative Cache Example:
Cache size: 32 KB (32768 bytes)
Block size: 64 bytes
Associativity: Fully associative (512-way)
Address space: 32 bits
Number of lines: 32768 / 64 = 512 lines
Number of sets: 1 set
Address breakdown (32 bits):
+--------------------+--------------+
| Tag | Block Offset |
| 26 bits | 6 bits |
+--------------------+--------------+
31 6 5 0
Cache Structure (1 set, 512 ways):
Single Set with 512 Ways:
Way0 Way1 Way2 Way510 Way511
+---+ +---+ +---+ +---+ +---+
|V|T| |V|T| |V|T| ... |V|T| |V|T|
|D | |D | |D | |D | |D |
+---+ +---+ +---+ +---+ +---+
All tags compared in parallel!
V = Valid, T = Tag (26 bits), D = Data (64 bytes)
Let us trace a memory access through this fully-associative cache. Consider address 0x12345678 (binary: 00010010001101000101011001111000). The tag is bits [31:6] equals 0b00010010001101000101011001 (we keep this in hex for readability: 0x48D15E), and the block offset is bits [5:0] equals 0b111000 (decimal 56). The cache must check all 512 ways simultaneously to see if any valid line has a matching tag.
Address 0x12345678 access:
Binary: 00010010001101000101011001 | 111000
Tag: 0x48D15E (26 bits) Offset: 56 (6 bits)
Check all 512 ways in parallel:
Way0 Way1 Way2 ... Way237 ... Way511
+---+ +---+ +---+ +---+ +---+
|1|..| |1|..| |0|..| ... |1|0x48D15E| ... |1|..|
|...| |...| |...| |Data block | |...|
+---+ +---+ +---+ +---+ +---+
^
|
Tag matches! HIT in Way 237
Return byte at offset 56
If no tag matches, we have a miss. The cache must decide which of the 512 lines to evict to make room for the new block. This is where replacement policies become critical, which we will explore in detail shortly. Unlike direct-mapped caches where replacement is forced (the single mapped line must be evicted) or low-associativity set-associative caches where the choice is among a few ways, fully-associative caches must choose among hundreds of candidates.
The hardware cost of fully-associative caches limits their practical use to small, specialized applications. Translation Lookaside Buffers (TLBs), which cache virtual-to-physical address translations, are often fully-associative because they are small (dozens to hundreds of entries) and the flexibility is valuable for reducing translation misses. Some victim caches and other specialized structures also use fully-associative organization. However, you will not find fully-associative designs in large L1, L2, or L3 caches due to the prohibitive cost of hundreds or thousands of parallel tag comparators.
The spectrum from direct-mapped to set-associative to fully-associative represents a fundamental tradeoff in computer architecture. Direct-mapped caches are fast, simple, and cheap but suffer from conflicts. Fully-associative caches eliminate conflicts but are expensive and slow. Set-associative caches provide the pragmatic middle ground, offering enough flexibility to reduce most conflicts while keeping hardware costs reasonable. The specific choice for any given cache depends on its position in the hierarchy, the workload characteristics, and the overall system design constraints.
Why Replacement Policies Matter
Replacement policies become necessary whenever a cache miss occurs and all available locations for the new block are already occupied. In a direct-mapped cache, replacement is trivial because there is only one possible location for each block. If that location is occupied, its contents must be evicted to make room for the new block. However, in set-associative and fully-associative caches, we have choices. When a miss occurs and all ways in the target set are valid, which way should we evict? This decision can significantly impact performance.
The fundamental challenge is predicting the future. The optimal replacement policy would evict the cache line that will not be needed for the longest time. Unfortunately, this requires knowing future memory access patterns, which is generally impossible. Instead, practical replacement policies use heuristics based on past behavior, making reasonable assumptions about which blocks are least likely to be needed soon. Different policies embody different assumptions and involve different implementation costs.
Consider a simple scenario: a 2-way set-associative cache where both ways of set 5 are occupied. A memory access requires loading a new block into set 5. Should we evict the block in way 0 or way 1? If we make the wrong choice and evict a block that will be accessed again soon, we will suffer another miss immediately. If we make the right choice and evict a block that will not be needed for a while, we avoid unnecessary misses. The quality of our replacement decisions directly affects the miss rate and overall system performance.
Common Replacement Policies
Several replacement policies have been developed and deployed in real systems, each with distinct characteristics. The Least Recently Used (LRU) policy evicts the cache line that has not been accessed for the longest time. LRU embodies the principle of temporal locality: if a line has not been used recently, it probably will not be needed soon. LRU typically provides good performance but requires tracking the access order of all lines in a set, which becomes expensive as associativity increases.
The First-In-First-Out (FIFO) policy evicts the oldest cache line regardless of its access pattern. FIFO is simpler than LRU because it only needs to track when each line was loaded, not every subsequent access. However, FIFO can perform poorly when a line loaded long ago continues to be accessed frequently. The Random replacement policy simply selects a victim at random from the available ways. Random replacement requires minimal hardware (just a random or pseudo-random number generator) and surprisingly performs reasonably well in many workloads, avoiding the pathological cases that can plague deterministic policies.
The Least Frequently Used (LFU) policy tracks how many times each line has been accessed and evicts the line with the lowest access count. LFU can identify truly cold data but struggles with changing access patterns. A block accessed many times in the distant past may have a high count even though it is no longer needed, while a newly loaded block that will be accessed frequently has a low count initially. The Not Recently Used (NRU) policy is a simplified approximation of LRU that requires less hardware. It uses a single reference bit per line, set when the line is accessed. Periodically, all reference bits are cleared. When replacement is needed, NRU evicts a line with reference bit equals zero, giving preference to recently accessed lines without tracking exact order.
LRU Replacement in Detail
Let us trace LRU replacement through a concrete example. Consider a 4-way set-associative cache with 64-byte blocks. We will focus on set 8 and observe how LRU manages the four ways as memory accesses occur. Initially, all four ways are invalid. We will use a simple encoding to track LRU order: each way has an age counter from 0 (most recently used) to 3 (least recently used). On every access to the set, we update the ages.
4-Way Set-Associative Cache - Set 8 LRU Example:
Initial state: All ways invalid
Way 0 Way 1 Way 2 Way 3
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
S 8: | 0 | --| - | | 0 | --| - | | 0 | --| - | | 0 | --| - |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
V T D V T D V T D V T D
Age: - Age: - Age: - Age: -
Access 1: Load block with tag 0x1A into set 8. All ways are invalid, so we choose way 0. After loading, way 0 has age 0 (most recent).
After Access 1 (Tag 0x1A):
Way 0 Way 1 Way 2 Way 3
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
S 8: | 1 |1A | D | | 0 | --| - | | 0 | --| - | | 0 | --| - |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
Age: 0 Age: - Age: - Age: -
Access 2: Load block with tag 0x2B into set 8. Way 1 is invalid, use it. Way 0 ages to 1, way 1 becomes 0.
After Access 2 (Tag 0x2B):
Way 0 Way 1 Way 2 Way 3
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
S 8: | 1 |1A | D | | 1 |2B | D | | 0 | --| - | | 0 | --| - |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
Age: 1 Age: 0 Age: - Age: -
LRU order: Way1(0), Way0(1)
Access 3: Load block with tag 0x3C. Way 2 is used. Ages update: way 2 becomes 0, way 1 becomes 1, way 0 becomes 2.
After Access 3 (Tag 0x3C):
Way 0 Way 1 Way 2 Way 3
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
S 8: | 1 |1A | D | | 1 |2B | D | | 1 |3C | D | | 0 | --| - |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
Age: 2 Age: 1 Age: 0 Age: -
LRU order: Way2(0), Way1(1), Way0(2)
Access 4: Load block with tag 0x4D. Way 3 is used, all ways now valid.
After Access 4 (Tag 0x4D):
Way 0 Way 1 Way 2 Way 3
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
S 8: | 1 |1A | D | | 1 |2B | D | | 1 |3C | D | | 1 |4D | D |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
Age: 3 Age: 2 Age: 1 Age: 0
LRU order: Way3(0), Way2(1), Way1(2), Way0(3) <- LRU victim
Access 5: Hit on tag 0x2B (way 1). No new load, but ages update. Way 1 becomes 0, ways 3 and 2 age.
After Access 5 (Tag 0x2B - HIT on Way 1):
Way 0 Way 1 Way 2 Way 3
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
S 8: | 1 |1A | D | | 1 |2B | D | | 1 |3C | D | | 1 |4D | D |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
Age: 3 Age: 0 Age: 2 Age: 1
LRU order: Way1(0), Way3(1), Way2(2), Way0(3) <- LRU victim
Access 6: Miss on tag 0x5E. All ways valid, must evict LRU. Way 0 has age 3 (LRU), evict it.
After Access 6 (Tag 0x5E - MISS, evict Way 0):
Way 0 Way 1 Way 2 Way 3
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
S 8: | 1 |5E | D | | 1 |2B | D | | 1 |3C | D | | 1 |4D | D |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
Age: 0 Age: 1 Age: 3 Age: 2
LRU order: Way0(0), Way1(1), Way3(2), Way2(3) <- LRU victim
(Tag 0x1A evicted from Way 0)
This example demonstrates how LRU maintains an ordered list of cache lines by recency. Each access updates the order, ensuring the least recently used line is always identifiable. Implementing true LRU for high associativity is complex. For a 4-way cache, we need 2 bits per way to encode ages 0-3, and we must update multiple age values on every access. For an 8-way cache, we need 3 bits per way and even more complex update logic. This is why many practical implementations use pseudo-LRU approximations that provide similar behavior with less hardware.
FIFO and Random Replacement Examples
FIFO replacement maintains a simpler invariant: it tracks the order in which lines were loaded and always evicts the oldest. Let us examine the same sequence of accesses using FIFO on the same 4-way set.
4-Way Set-Associative Cache - Set 8 FIFO Example:
Access 1-4: Load tags 0x1A, 0x2B, 0x3C, 0x4D into ways 0, 1, 2, 3.
After Access 4:
Way 0 Way 1 Way 2 Way 3
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
S 8: | 1 |1A | D | | 1 |2B | D | | 1 |3C | D | | 1 |4D | D |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
Load order: 1st 2nd 3rd 4th
FIFO victim: Way0 (oldest)
Access 5: Hit on tag 0x2B. Unlike LRU, FIFO does not update the replacement order on hits. Way 0 remains the oldest.
After Access 5 (Tag 0x2B - HIT on Way 1):
Same as above, FIFO order unchanged.
FIFO victim: Way0 (oldest)
Access 6: Miss on tag 0x5E. Evict way 0 (oldest), load new block into way 0. Way 1 is now oldest.
After Access 6 (Tag 0x5E - MISS, evict Way 0):
Way 0 Way 1 Way 2 Way 3
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
S 8: | 1 |5E | D | | 1 |2B | D | | 1 |3C | D | | 1 |4D | D |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
Load order: 5th 2nd 3rd 4th
FIFO victim: Way1 (oldest)
(Tag 0x1A evicted from Way 0)
Notice that in access 5, both LRU and FIFO had way 0 as the victim before the access. However, LRU changed the victim to way 2 after hitting on way 1, while FIFO kept way 0 as the victim. If the next access after access 5 had been to tag 0x1A, LRU would have hit (because 0x1A is still in the cache), but then on access 6, FIFO evicted 0x1A anyway. This illustrates why LRU can perform better: it adapts to the actual access pattern rather than just the load order.
Random replacement is even simpler to demonstrate. At each replacement decision, we generate a random number to select the victim way. Suppose our random number generator produces the sequence: 2, 1, 3, 0, ...
4-Way Set-Associative Cache - Set 8 Random Replacement Example:
After Access 4 (all ways loaded):
Way 0 Way 1 Way 2 Way 3
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
S 8: | 1 |1A | D | | 1 |2B | D | | 1 |3C | D | | 1 |4D | D |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
Access 6: Miss on 0x5E. Random selection: way 2. Evict way 2.
After Access 6:
Way 0 Way 1 Way 2 Way 3
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
S 8: | 1 |1A | D | | 1 |2B | D | | 1 |5E | D | | 1 |4D | D |
+---+---+---+ +---+---+---+ +---+---+---+ +---+---+---+
(Tag 0x3C evicted from Way 2)
Random replacement does not track access patterns or load order, so it makes no promises about optimality. However, it avoids systematic biases that can cause pathological behavior in certain access patterns. For instance, a program that repeatedly accesses N+1 different blocks that all map to an N-way set will thrash with FIFO but has a chance to hit with random replacement. Random policies are popular in large caches where tracking precise LRU information is too expensive.
Comparing Replacement Policies
Different replacement policies exhibit different strengths and weaknesses. LRU generally provides the best hit rate because it most accurately captures temporal locality. Programs tend to reuse recently accessed data, so evicting the least recently used line is usually a good heuristic. However, LRU struggles with certain pathological patterns like scanning through an array larger than the cache, where every access misses regardless of the replacement policy. The hardware cost of true LRU grows quickly with associativity, making it impractical for highly associative caches.
FIFO is simpler than LRU and works reasonably well when access patterns align with load order. If a program loads several blocks and then proceeds to use them in roughly the same order they were loaded, FIFO approximates LRU. However, FIFO fails when frequently accessed blocks happen to be old. A block loaded early that continues to be accessed will eventually be evicted simply because it is old, even though it is still hot. This makes FIFO less robust than LRU across diverse workloads.
Random replacement is the simplest to implement and surprisingly competitive. While it does not capture locality like LRU, it avoids the worst-case behaviors that can plague deterministic policies. Random replacement performs poorly on workloads with strong, consistent access patterns where LRU shines, but it handles pathological cases and mixed workloads more gracefully. Many large L3 caches use random or pseudo-random replacement because the hardware cost of tracking LRU for 16-way or higher associativity is prohibitive.
In practice, many caches use pseudo-LRU algorithms that approximate true LRU with less hardware. One common approach uses a binary tree of bits to track relative ages without storing exact ordering. For a 4-way set, a 3-bit tree can identify an approximate LRU victim with much simpler update logic than true LRU. These approximations sacrifice some hit rate improvement for significant hardware savings, representing yet another point in the vast space of cache design tradeoffs.
Memory Access Time Through the Cache Hierarchy
Understanding cache performance requires quantifying the time cost of memory accesses. The effective memory access time depends on whether each access hits or misses at each level of the cache hierarchy. When we have a hit at L1, the access completes quickly. When we miss at L1, we must check L2, incurring additional delay. If we miss at L2, we check L3, and so on, until we either hit or reach main memory. The cumulative effect of these hit and miss times determines overall system performance.
Let us establish some terminology and basic formulas. The hit time for a cache level is the time required to access that level and determine whether we have a hit or miss. The miss penalty is the additional time required when we miss at that level, which includes accessing the next level of the hierarchy. The hit rate is the fraction of accesses that hit at a given level. The miss rate is simply one minus the hit rate. Using these definitions, we can calculate the average memory access time.
For a simple single-level cache system with just L1 and main memory, the average memory access time follows a straightforward formula. We either hit in L1 (probability equals hit_rate_L1) and pay only the L1 hit time, or we miss in L1 (probability equals miss_rate_L1) and pay both the L1 hit time (we still had to check L1) and the main memory access time. The formula is:
Single-Level Cache Average Memory Access Time:
AMAT = hit_time_L1 + (miss_rate_L1 × memory_access_time)
Where:
hit_time_L1 = time to access L1 cache
miss_rate_L1 = fraction of accesses that miss in L1
memory_access_time = time to access main memory on L1 miss
Example with realistic values:
hit_time_L1 = 1 cycle
miss_rate_L1 = 0.05 (5% miss rate, 95% hit rate)
memory_access_time = 100 cycles
AMAT = 1 + (0.05 × 100)
= 1 + 5
= 6 cycles average per memory access
This example shows why even a small cache with a reasonable hit rate dramatically improves performance. With a 95% hit rate, the average access time is only 6 cycles instead of 100 cycles. The cache effectively hides the memory latency for the vast majority of accesses. However, real systems have multiple cache levels, which complicates the calculation but provides even better performance.
Multi-Level Cache Access Time
In a multi-level cache hierarchy with L1, L2, and L3, the access time calculation becomes recursive. When we miss at L1, we access L2. If we hit at L2, we pay the L1 hit time plus the L2 hit time. If we miss at L2, we must access L3, and so on. The key insight is that each level's miss penalty is the access time for the next level, which itself depends on that level's hit and miss rates.
Let us build this calculation step by step. Consider a system with L1, L2, L3, and main memory. Define the following parameters:
Cache Hierarchy Parameters:
L1: hit_time = 1 cycle, miss_rate = 0.05 (5%)
L2: hit_time = 10 cycles, miss_rate = 0.20 (20% of L1 misses)
L3: hit_time = 40 cycles, miss_rate = 0.50 (50% of L2 misses)
Memory: access_time = 200 cycles
Note: L2 miss rate is the LOCAL miss rate (misses in L2 divided by
accesses to L2), not the GLOBAL miss rate (misses in L2 divided by
all memory accesses).
First, we calculate the effective access time for L3. When we access L3, we either hit there or miss and go to main memory:
L3 effective access time:
= L3_hit_time + (L3_miss_rate × memory_access_time)
= 40 + (0.50 × 200)
= 40 + 100
= 140 cycles
Next, we calculate the effective access time for L2. When we access L2, we either hit there or miss and proceed to L3:
L2 effective access time:
= L2_hit_time + (L2_miss_rate × L3_effective_access_time)
= 10 + (0.20 × 140)
= 10 + 28
= 38 cycles
Finally, we calculate the overall average memory access time. When we access memory, we always check L1 first. We either hit in L1 or miss and proceed to L2:
Overall AMAT:
= L1_hit_time + (L1_miss_rate × L2_effective_access_time)
= 1 + (0.05 × 38)
= 1 + 1.9
= 2.9 cycles average per memory access
This is remarkable. Despite main memory requiring 200 cycles to access, the average memory access completes in under 3 cycles thanks to the multi-level cache hierarchy. The three cache levels work together to hide the memory latency for nearly all accesses. Let us trace through some specific access scenarios to see how this plays out in practice.
Detailed Access Time Examples
Consider a sequence of memory accesses through our three-level cache hierarchy. We will calculate the actual time cost for each access based on where it hits or misses. This concrete example will illustrate how the average access time emerges from the distribution of hits and misses across the hierarchy.
Access Scenario 1: L1 Hit
Access: Read from address 0x1000
L1: Check cache, tag matches → HIT!
Total time: 1 cycle
The best case: data found immediately in L1.
Access Scenario 2: L1 Miss, L2 Hit
Access: Read from address 0x2000
L1: Check cache, tag doesn't match → MISS (1 cycle)
L2: Check cache, tag matches → HIT! (10 cycles)
Total time: 1 + 10 = 11 cycles
Common case for data recently evicted from L1 but still in L2.
Access Scenario 3: L1 Miss, L2 Miss, L3 Hit
Access: Read from address 0x3000
L1: Check cache, tag doesn't match → MISS (1 cycle)
L2: Check cache, tag doesn't match → MISS (10 cycles)
L3: Check cache, tag matches → HIT! (40 cycles)
Total time: 1 + 10 + 40 = 51 cycles
Less common, but still much faster than going to main memory.
Access Scenario 4: L1 Miss, L2 Miss, L3 Miss, Memory Access
Access: Read from address 0x4000
L1: Check cache, tag doesn't match → MISS (1 cycle)
L2: Check cache, tag doesn't match → MISS (10 cycles)
L3: Check cache, tag doesn't match → MISS (40 cycles)
Memory: Fetch from DRAM (200 cycles)
Total time: 1 + 10 + 40 + 200 = 251 cycles
Worst case: data must be fetched all the way from main memory.
This is why we desperately want high hit rates!
Now let us consider a realistic access sequence and compute the overall average. Suppose we have 1000 memory accesses with the following distribution based on our hit and miss rates:
Distribution of 1000 memory accesses:
L1 hits: 950 accesses (95% of 1000)
Cost: 950 × 1 = 950 cycles
L1 misses, L2 hits: 40 accesses (80% of the 50 L1 misses)
Cost: 40 × 11 = 440 cycles
L1 misses, L2 misses, L3 hits: 5 accesses (50% of the 10 L2 misses)
Cost: 5 × 51 = 255 cycles
L1 misses, L2 misses, L3 misses: 5 accesses (50% of the 10 L2 misses)
Cost: 5 × 251 = 1255 cycles
Total time for 1000 accesses: 950 + 440 + 255 + 1255 = 2900 cycles
Average time per access: 2900 / 1000 = 2.9 cycles
This matches our calculated average memory access time of 2.9 cycles from the formula. The majority of accesses (950 out of 1000) hit in L1 and complete in just 1 cycle each. The 40 accesses that miss in L1 but hit in L2 add some overhead but remain relatively fast. The rare accesses that miss in both L1 and L2 contribute disproportionately to the total time, but because they are so infrequent, they do not dominate the average.
Impact of Cache Parameters on Performance
The effective memory access time is highly sensitive to cache hit rates. Even small changes in miss rates can significantly impact performance. Let us examine how varying the L1 miss rate affects the overall average memory access time, keeping other parameters constant.
Impact of L1 Miss Rate on AMAT:
L1 miss rate = 0.01 (1%):
AMAT = 1 + (0.01 × 38) = 1.38 cycles
L1 miss rate = 0.05 (5%):
AMAT = 1 + (0.05 × 38) = 2.90 cycles
L1 miss rate = 0.10 (10%):
AMAT = 1 + (0.10 × 38) = 4.80 cycles
L1 miss rate = 0.20 (20%):
AMAT = 1 + (0.20 × 38) = 8.60 cycles
The difference between a 1% and 20% L1 miss rate is a factor of 6×
in average memory access time! This demonstrates why cache optimization
is critical for performance.
Similarly, the size and associativity of caches affect miss rates and thus performance. A larger cache can hold more data, reducing capacity misses. Higher associativity reduces conflict misses. However, both come at the cost of increased access time and hardware complexity. Consider the tradeoff between cache size and access time:
Tradeoff: Larger L1 Cache vs. Access Time
Configuration A: 32 KB L1, 1-cycle access, 5% miss rate
AMAT = 1 + (0.05 × 38) = 2.90 cycles
Configuration B: 64 KB L1, 2-cycle access, 3% miss rate
AMAT = 2 + (0.03 × 38) = 3.14 cycles
Configuration C: 128 KB L1, 3-cycle access, 2% miss rate
AMAT = 3 + (0.02 × 38) = 3.76 cycles
In this example, doubling the L1 size reduces miss rate but increases
access time. Configuration A (smaller, faster cache) actually provides
better average performance than the larger caches despite higher miss
rates. This illustrates why most processors use small, fast L1 caches.
The interplay between cache levels also matters. If L2 is very large and fast, a higher L1 miss rate becomes more tolerable. Conversely, if L2 is slow or has a high miss rate, L1 performance becomes critical. System designers must consider the entire hierarchy when optimizing performance, balancing the characteristics of each level to achieve the best overall result for the target workload.
Semiconductor RAM Memories: SRAM and DRAM
Understanding caches requires understanding the underlying memory technologies. Random Access Memory comes in two primary forms: Static RAM (SRAM) and Dynamic RAM (DRAM). These technologies differ fundamentally in how they store information, their speed characteristics, their density, and their cost. The choice between SRAM and DRAM for a particular application involves weighing these competing factors.
Static RAM stores each bit using a bistable circuit typically composed of six transistors arranged as cross-coupled inverters. Once written, an SRAM cell maintains its state indefinitely as long as power is applied. No refresh is needed because the cell actively maintains its state through feedback. The six-transistor configuration provides two stable states corresponding to logic 0 and logic 1. Reading from SRAM is non-destructive; the cell retains its value after being read. Writing involves forcing the cell into the desired state by driving the bit lines.
SRAM Cell Structure (6-Transistor):
VDD
|
+---+---+
| |
Q1 Q3
| |
+---+---+
| |
Q + + Q' (Cross-coupled inverters)
| |
+---+---+
| |
Q2 Q4
| |
+---+---+
|
GND
Word Line (WL) controls Q5 and Q6 (access transistors)
Bit Lines (BL, BL') carry data in/out
Read Operation:
- WL activated, connects cell to bit lines
- Cell drives bit lines to stored value
- Sense amplifier detects voltage difference
- Cell retains value (non-destructive)
Write Operation:
- Bit lines driven to desired value
- WL activated
- Bit lines overpower cell, forcing new state
- WL deactivated, cell holds new value
Dynamic RAM stores each bit as charge on a tiny capacitor. A DRAM cell consists of just one transistor and one capacitor, making it much denser than SRAM. However, the capacitor leaks charge over time, typically losing its state within milliseconds. This requires periodic refresh operations where the memory controller reads each cell and writes the value back, restoring the full charge. Refresh adds complexity and consumes power but enables much higher density.
DRAM Cell Structure (1-Transistor, 1-Capacitor):
Word Line (WL)
|
Q1 (Access Transistor)
|
=== C (Storage Capacitor)
|
GND
Read Operation:
- Bit line precharged to VDD/2
- WL activated, connects capacitor to bit line
- Charge sharing occurs between C and bit line
- Sense amplifier detects small voltage change
- Cell must be refreshed after read (destructive)
Write Operation:
- Bit line driven high (logic 1) or low (logic 0)
- WL activated
- Capacitor charges or discharges through transistor
- WL deactivated, capacitor holds charge
Refresh Operation (must occur every few milliseconds):
- Each row read and written back sequentially
- Restores full charge on capacitors
- Prevents data loss from leakage
The access time characteristics differ significantly between SRAM and DRAM. SRAM access is straightforward: activate the word line, and the cell immediately drives the bit lines to the stored value. Sense amplifiers detect the voltage, and the read completes in a few nanoseconds. DRAM access is more complex because the charge sharing between the tiny capacitor and the bit line produces only a small voltage swing that must be amplified. Additionally, DRAM reads are destructive, requiring the value to be written back. A DRAM access might take 50 to 100 nanoseconds, roughly ten times slower than SRAM.
The density and cost differences are dramatic. An SRAM cell requires six transistors plus the area for routing word lines and bit lines. A DRAM cell requires one transistor and one capacitor, which can be stacked vertically to minimize area. This means DRAM achieves roughly six times the density of SRAM for the same silicon area. Since silicon area directly determines cost, DRAM costs significantly less per bit than SRAM. A modern DRAM chip might provide 8 gigabits in a package costing a few dollars, while an SRAM chip of similar capacity would be prohibitively expensive.
Technology Comparison:
SRAM DRAM
Cell Size: 6 transistors 1T + 1C (6× smaller)
Access Time: 1-10 ns 50-100 ns
Refresh: Not required Required (every 2-64 ms)
Read: Non-destructive Destructive
Power: Higher (always on) Lower (except refresh)
Cost/bit: High Low
Density: Low High (6× better)
Typical Use: Caches, registers Main memory
Memory Hierarchy Mapping:
+----------+--------+-------------+
| Level | Type | Technology |
+----------+--------+-------------+
| L1 Cache | SRAM | 6T cell |
| L2 Cache | SRAM | 6T cell |
| L3 Cache | SRAM | 6T cell |
| Main Mem | DRAM | 1T1C cell |
+----------+--------+-------------+
Modern DRAM has evolved to improve performance through various techniques. Synchronous DRAM (SDRAM) synchronizes its operations with a clock signal, allowing the memory controller to pipeline requests. Double Data Rate (DDR) SDRAM transfers data on both rising and falling clock edges, doubling throughput. DDR has evolved through multiple generations (DDR2, DDR3, DDR4, DDR5), each improving speed and reducing power consumption. Despite these improvements, DRAM remains fundamentally slower than SRAM due to the physics of capacitive storage and charge sensing.
Read-Only Memories
While RAM provides read-write storage, Read-Only Memory (ROM) and its derivatives serve a different purpose: storing data that must persist without power and, in some cases, cannot or should not be modified during normal operation. ROM technologies range from truly immutable mask-programmed ROM to electrically erasable and reprogrammable Flash memory. Each variant offers different tradeoffs between permanence, programmability, and cost.
Mask-programmed ROM is manufactured with its contents fixed during fabrication. The data pattern determines which connections exist in the memory array, physically encoding the information in the silicon. Once manufactured, the contents cannot be changed. Mask ROM is economical only for very large production volumes because the custom mask costs are amortized over millions of units. Typical applications include storing firmware in consumer electronics where the code never changes and production volumes justify the manufacturing setup costs.
Programmable ROM (PROM) arrives from the factory blank and can be programmed once by the user using a special device called a PROM programmer. Programming involves blowing tiny fuses or creating permanent connections using high voltage or current pulses. Once programmed, a PROM cannot be erased or reprogrammed. PROMs were historically popular for small-volume production or prototyping where mask ROM economics did not apply. Modern usage has largely shifted to erasable alternatives.
ROM Technology Evolution:
Mask ROM:
[Fabricated] → [Fixed Forever]
+ Low cost at high volume
+ Fast access
- Cannot change after fabrication
- High setup cost
PROM (Programmable ROM):
[Blank] → [Program Once] → [Fixed]
+ User programmable
+ No setup cost
- Cannot erase/reprogram
- Requires PROM programmer
EPROM (Erasable Programmable ROM):
[Blank] → [Program] → [Erase (UV)] → [Program] → ...
+ Reprogrammable (erase with UV light)
+ Retain data for years
- Slow erase (20-30 min UV exposure)
- Must remove from circuit to erase
- Requires quartz window (expensive)
EEPROM (Electrically Erasable PROM):
[Blank] → [Program] → [Erase (Electrical)] → [Program] → ...
+ Electrically erasable (in-circuit)
+ Byte-level erase
+ Fast erase (milliseconds)
- Limited erase cycles (~100K-1M)
- Higher cost per bit
Flash Memory:
[Blank] → [Program] → [Erase (Block)] → [Program] → ...
+ Electrically erasable
+ High density
+ Low cost per bit
- Block erase only (not byte-level)
- Limited erase cycles (~10K-100K)
+ Common in SSDs, USB drives, memory cards
Erasable Programmable ROM (EPROM) can be programmed electrically and erased using ultraviolet light. EPROMs feature a quartz window through which UV light floods the chip during erasure. The UV light provides energy to discharge the floating gates where charge storage represents the programmed state. Erasure takes 20 to 30 minutes of UV exposure and erases the entire chip simultaneously. After erasure, the EPROM can be reprogrammed. EPROMs were workhorses of embedded system development, allowing engineers to iterate on firmware without manufacturing new chips.
Electrically Erasable Programmable ROM (EEPROM) improves on EPROM by enabling electrical erasure without removing the chip from the circuit. EEPROM can erase and reprogram individual bytes, making it suitable for storing configuration data that changes occasionally. The electrical erase mechanism uses quantum tunneling to remove charge from floating gates. EEPROMs typically endure 100,000 to 1 million erase cycles before wearing out. They are commonly used for storing calibration data, serial numbers, and small amounts of persistent configuration in embedded systems.
Flash memory represents the dominant modern non-volatile storage technology. Flash combines the density and cost advantages of DRAM-like cell structures with the non-volatility of EEPROM. Unlike EEPROM, Flash erases in blocks rather than individual bytes, sacrificing some flexibility for improved density and cost. There are two main Flash types: NOR Flash, which provides random access like traditional ROM and is used for code storage, and NAND Flash, which is optimized for sequential access and high density, making it ideal for solid-state drives and memory cards.
Flash Memory Architecture (NAND):
Block Structure:
+------------------+
| Block 0 | Each block: 64-256 pages
| Page 0 (4 KB) |
| Page 1 | Erase granularity: Block
| ... | Write granularity: Page
| Page 127 | Read granularity: Page
+------------------+
| Block 1 |
| ... |
+------------------+
| Block N |
+------------------+
Operations:
- Read: Any page, ~25 µs
- Write: Entire page, ~200 µs (must be previously erased)
- Erase: Entire block, ~2 ms (sets all bits to 1)
Limitations:
- Must erase before writing
- Erase cycles limited (~10K-100K)
- Wear leveling required to distribute writes
- Error correction needed (bit errors increase with wear)
Memory Module Design and Organization
Main memory systems consist of multiple memory chips organized into modules and arrays to achieve the desired capacity and performance. Memory chips provide only a portion of the total system memory, so multiple chips must be combined. The organization of these chips determines the effective word width, total capacity, and access patterns. Understanding memory module design clarifies how the simple DRAM cells discussed earlier scale to gigabytes of system memory.
A memory chip has a specified organization, typically expressed as the number of words times the number of bits per word. For example, a "512K × 8" chip provides 524,288 addressable locations, each 8 bits wide. To build a memory system with a different word width or larger capacity, we combine multiple chips. If the processor requires 32-bit words but we have 8-bit chips, we use four chips in parallel, each providing one byte of the 32-bit word. Address lines connect to all four chips in parallel, data lines connect to different bytes of the data bus, and all chips are selected simultaneously.
Memory Chip Organization Example:
Single Chip (512K × 8):
A0-A18 (19 address lines: 2^19 = 512K)
|||||
+-------------+
| 512K×8 |
| DRAM Chip |
+-------------+
||||||||
D0-D7 (8 data lines)
Building 512K × 32 using four 512K × 8 chips:
A0-A18 ────┬────┬────┬────┐
↓ ↓ ↓ ↓
+---+ +---+ +---+ +---+
|512| |512| |512| |512|
|K×8| |K×8| |K×8| |K×8|
+---+ +---+ +---+ +---+
↓ ↓ ↓ ↓
D0-D7 D8-D15 D16-23 D24-31
\_________ _________/
\/
32-bit data word
All chips receive same address
Each chip provides one byte of the 32-bit word
Chip enables all active simultaneously
To increase capacity beyond a single chip's addressing range, we use memory banks with chip select signals. Consider building a 2M × 32 memory system using 512K × 8 chips. We need four chips in parallel for the 32-bit width, and we need four such sets to reach 2M words. The higher-order address bits decode which bank (set of four chips) to activate. Only one bank is active at a time, with the lower address bits selecting the location within that bank.
Building 2M × 32 from 512K × 8 chips (requires 16 chips total):
Address breakdown (21 bits for 2M):
+--------+------------------+
| A19-20 | A0-A18 |
| 2 bits | 19 bits |
+--------+------------------+
| |
| └─→ Chip address (512K locations)
└─→ Bank select (4 banks)
Memory Organization:
A0-A18
↓
Bank 0: [512K×8] [512K×8] [512K×8] [512K×8] ← CS0 (A19-20 = 00)
D0-D7 D8-D15 D16-D23 D24-D31
Bank 1: [512K×8] [512K×8] [512K×8] [512K×8] ← CS1 (A19-20 = 01)
D0-D7 D8-D15 D16-D23 D24-D31
Bank 2: [512K×8] [512K×8] [512K×8] [512K×8] ← CS2 (A19-20 = 10)
D0-D7 D8-D15 D16-D23 D24-D31
Bank 3: [512K×8] [512K×8] [512K×8] [512K×8] ← CS3 (A19-20 = 11)
D0-D7 D8-D15 D16-D23 D24-D31
2-to-4 Decoder:
A19-A20 → Activates one of CS0, CS1, CS2, CS3
Only selected bank responds to addresses A0-A18
Modern memory modules like DIMMs (Dual Inline Memory Modules) package multiple chips on a small circuit board. A typical DDR4 DIMM might contain eight or sixteen DRAM chips, each providing a portion of the total data width and capacity. The module includes addressing and control logic, buffering, and standardized connectors. Memory controllers on the motherboard interact with DIMMs through standardized interfaces, abstracting the details of the individual chips.
The internal organization of DRAM chips uses a two-dimensional array structure to minimize the number of address pins required. Instead of providing all address bits simultaneously, DRAM uses row and column multiplexing. The row address is latched first (RAS - Row Address Strobe), selecting one row of the array. Then the column address is latched (CAS - Column Address Strobe), selecting the specific bit or word within that row. This halves the number of address pins needed, reducing package cost and complexity.
DRAM Internal Organization (Example: 4M × 1):
2048 rows × 2048 columns = 4M bits
Step 1: Row Address (A0-A10, 11 bits → 2048 rows)
┌───────────────────────────┐
│ Row │
│ Decoder │
└────┬──────────────────────┘
├─ Row 0 ─→ [2048 columns]
├─ Row 1 ─→ [2048 columns]
├─ Row 2 ─→ [2048 columns]
│ ...
└─ Row 2047 → [2048 columns]
Step 2: Column Address (A0-A10, 11 bits → 2048 columns)
Selected row → Column Mux → Data Out
Timing Diagram:
___ ___
RAS: __| |__________________| |__
___ ___
CAS: _______| |__________| |_____
Address: [Row Addr][Col Addr]
↓ ↓
Latch Latch
Total pins: 11 address (multiplexed) vs. 22 (non-multiplexed)
Memory Interleaving
Memory interleaving is a technique to increase memory bandwidth by distributing consecutive addresses across multiple memory modules or banks that can operate concurrently. While a single memory module may have a significant access latency, multiple modules can be accessed in a pipelined fashion, dramatically increasing throughput. Interleaving exploits the spatial locality of memory accesses: programs often access consecutive or nearby memory locations, allowing multiple concurrent accesses to proceed simultaneously.
The fundamental idea behind interleaving is simple. Instead of placing consecutive addresses in the same memory bank, we distribute them across multiple banks. When the processor issues a burst of sequential memory requests, each request goes to a different bank. While one bank is busy accessing its data, other banks can begin processing their requests. By the time the first bank completes its access, we can start another request to it, creating a pipeline of memory operations. The effective bandwidth approaches the number of banks multiplied by the bandwidth of a single bank.
Consider a memory system without interleaving where all addresses reside in a single bank. If each memory access requires 100 nanoseconds, the system can complete at most 10 million accesses per second regardless of how many requests are pending. Now consider a system with four interleaved banks. The first access goes to bank 0 and requires 100 ns. However, we can start the second access (to bank 1) immediately, the third access (to bank 2) 25 ns later, and the fourth access (to bank 3) another 25 ns later. After 100 ns, the first access completes and we can issue another access to bank 0. The effective access time becomes 25 ns per access instead of 100 ns, quadrupling the bandwidth.
Memory Access Timing Without Interleaving (Single Bank):
Time: 0 100 200 300 400 (nanoseconds)
| | | | |
Bank 0: [Acc 0][Acc 1][Acc 2][Acc 3]
Each access: 100 ns
Throughput: 1 access / 100 ns = 10M accesses/sec
Memory Access Timing With 4-Way Interleaving:
Time: 0 25 50 75 100 125 150 175 200 (nanoseconds)
| | | | | | | | |
Bank 0: [Acc 0 ][Acc 4 ]...
Bank 1: [Acc 1 ][Acc 5 ]...
Bank 2: [Acc 2 ][Acc 6 ]...
Bank 3: [Acc 3 ][Acc 7 ]...
Each access: 100 ns (same latency)
But one completes every 25 ns (pipelined)
Throughput: 1 access / 25 ns = 40M accesses/sec (4× improvement)
There are two primary interleaving schemes: low-order interleaving and high-order interleaving. Low-order interleaving uses the lowest address bits to select the bank, distributing consecutive addresses across banks. High-order interleaving uses the highest address bits to select the bank, keeping consecutive addresses in the same bank. The choice between these schemes depends on the access patterns of the workload.
Low-order interleaving maximizes bandwidth for sequential access patterns. When a program reads an array or executes consecutive instructions, each access goes to a different bank automatically. This is the most common interleaving scheme because sequential access is prevalent in typical programs. The bank select bits are the lowest bits of the address that are not used for byte selection within a word. For a system with four banks and 4-byte words, bits [3:2] of the address select the bank, ensuring that consecutive words go to different banks.
Low-Order Interleaving (4 banks, 4-byte words):
Address breakdown:
+------------------+------+----+
| Higher bits | Bank | BO |
| (word in bank) | [3:2]| [1:0] |
+------------------+------+----+
↓
Selects bank
Address Distribution:
Address Bank Word within Bank
0x0000 → Bank 0 Word 0
0x0004 → Bank 1 Word 0
0x0008 → Bank 2 Word 0
0x000C → Bank 3 Word 0
0x0010 → Bank 0 Word 1
0x0014 → Bank 1 Word 1
0x0018 → Bank 2 Word 1
0x001C → Bank 3 Word 1
Sequential access pattern:
0x0000, 0x0004, 0x0008, 0x000C, 0x0010, ...
Bank0 Bank1 Bank2 Bank3 Bank0 ...
Perfect distribution → Maximum bandwidth for sequential access
High-order interleaving uses the highest address bits to select the bank, grouping consecutive addresses together. This scheme benefits workloads that access large contiguous blocks from one area of memory before moving to another area. However, for sequential access across the entire address space, high-order interleaving provides no concurrency benefit because all consecutive accesses go to the same bank until that bank's address range is exhausted.
High-Order Interleaving (4 banks, 1 MB per bank, 4 MB total):
Address breakdown (assuming 22-bit addresses for 4 MB):
+------+------------------+----+
| Bank | Address in Bank | BO |
|[21:20]| [19:2] |[1:0] |
+------+------------------+----+
↓
Selects bank
Address Distribution:
0x000000 - 0x0FFFFC → Bank 0 (1 MB)
0x100000 - 0x1FFFFC → Bank 1 (1 MB)
0x200000 - 0x2FFFFC → Bank 2 (1 MB)
0x300000 - 0x3FFFFC → Bank 3 (1 MB)
Sequential access within Bank 0:
0x000000, 0x000004, 0x000008, ...
Bank0 Bank0 Bank0 ...
No interleaving benefit → Sequential bottleneck
But accessing four separate regions concurrently works well:
Thread 1: Bank 0 (0x000000-0x0FFFFC)
Thread 2: Bank 1 (0x100000-0x1FFFFC)
Thread 3: Bank 2 (0x200000-0x2FFFFC)
Thread 4: Bank 3 (0x300000-0x3FFFFC)
Interleaving introduces dependencies between banks that can cause conflicts. If two consecutive accesses target the same bank, the second access must wait for the first to complete before beginning. This occurs naturally with low-order interleaving when the stride of memory accesses equals a multiple of the number of banks. For example, in a 4-bank system with low-order interleaving, accessing every fourth word (stride 4) causes all accesses to hit the same bank, eliminating the interleaving benefit.
Bank Conflicts in Low-Order Interleaving:
4-bank system, 4-byte words, stride-4 access pattern:
Access addresses: 0x0000, 0x0010, 0x0020, 0x0030, ...
↓ ↓ ↓ ↓
Bank bits [3:2]: 00 00 00 00
↓ ↓ ↓ ↓
All hit Bank 0: [Acc 0][Acc 1][Acc 2][Acc 3]
No concurrency! Same performance as non-interleaved memory.
This happens when: stride (in words) = k × number_of_banks
For this example: stride = 4 words, banks = 4, k = 1
Solution: Use prime number of banks or carefully choose stride
Modern memory systems often use more sophisticated interleaving schemes. DRAM modules themselves internally interleave across multiple chips on the module. Memory controllers may support multiple channels, each channel effectively acting as an independent interleaved bank. Processors with multiple memory controllers can interleave across controllers. These nested levels of interleaving provide massive aggregate bandwidth, which is essential for feeding the data-hungry cores of modern processors.
Virtual Memory: Abstraction and Protection
Virtual memory is one of the most important abstractions in computer architecture, providing each program with the illusion of a large, private address space while efficiently sharing the physical memory among multiple programs. Virtual memory decouples the addresses that programs use (virtual addresses) from the actual locations in physical memory (physical addresses). This abstraction enables powerful operating system features: memory protection between processes, efficient sharing of memory, simplified memory allocation, and the ability to run programs larger than physical memory by storing portions on disk.
The fundamental concept is address translation. When a program accesses memory using a virtual address, the hardware translates that address to a physical address where the data actually resides. This translation is transparent to the program; the program believes it is directly accessing memory at the virtual address. The translation mechanism checks permissions and can trigger page faults if the requested data is not currently in physical memory, allowing the operating system to load the data from disk and retry the access.
Virtual memory divides the address space into fixed-size blocks called pages. Typical page sizes are 4 KB, though larger pages (2 MB, 1 GB) are also common for specific applications. The virtual address space consists of virtual pages, numbered from 0 to the maximum virtual page number. Physical memory consists of page frames, which are physical page-sized blocks numbered from 0 to the maximum physical page number. The page table, maintained by the operating system, maps virtual page numbers to physical page frame numbers.
Virtual Memory Concepts:
Virtual Address Space (per process):
+------------------+ Virtual Page 0
| 4 KB |
+------------------+ Virtual Page 1
| 4 KB |
+------------------+ Virtual Page 2
| 4 KB |
+------------------+
| ... |
+------------------+ Virtual Page N
| 4 KB |
+------------------+
Physical Memory (shared):
+------------------+ Physical Frame 0
| 4 KB |
+------------------+ Physical Frame 1
| 4 KB |
+------------------+ Physical Frame 2
| 4 KB |
+------------------+
| ... |
+------------------+ Physical Frame M
| 4 KB |
+------------------+
Note: Virtual pages >> Physical frames
Many virtual pages may not be in physical memory (stored on disk)
The address translation process begins with splitting the virtual address into two parts: the virtual page number (VPN) and the page offset. For a 4 KB page, the offset requires 12 bits (2^12 = 4096 bytes per page), so the lower 12 bits of the address form the offset. The remaining upper bits form the virtual page number. The page table is indexed by the VPN to retrieve the physical page number (PPN), also called the physical frame number (PFN). The physical address is then constructed by concatenating the PPN with the unchanged page offset.
Virtual to Physical Address Translation:
Virtual Address (32-bit, 4 KB pages):
+--------------------+------------------+
| Virtual Page Number| Page Offset |
| (VPN) | (PO) |
| 20 bits | 12 bits |
+--------------------+------------------+
31 12 11 0
↓
Look up in Page Table
↓
+-------------+
| Page Table |
| VPN → PPN |
+-------------+
↓
Physical Page Number (PPN)
↓
Physical Address:
+--------------------+------------------+
| Physical Page Number| Page Offset |
| (PPN) | (PO) |
| variable bits | 12 bits |
+--------------------+------------------+
Note: Page offset is NOT translated (direct copy from virtual to physical)
Only the page number is translated via the page table
Let us trace a concrete example of address translation. Assume a system with 32-bit virtual addresses, 30-bit physical addresses, and 4 KB pages. The page offset requires 12 bits, leaving 20 bits for the VPN and 18 bits for the PPN. Consider virtual address 0x00403A04. In binary, this is 00000000010000000011101000000100. The VPN is the upper 20 bits: 0x00403 (decimal 1027). The offset is the lower 12 bits: 0xA04 (decimal 2564).
Address Translation Example:
Virtual Address: 0x00403A04
Binary: 0000 0000 0100 0000 0011 1010 0000 0100
Split into VPN and Offset:
+-------------------------+--------------+
| VPN | Offset |
| 0000 0000 0100 0000 0011| 1010 0000 0100|
| 0x00403 | 0xA04 |
| (decimal 1027) | (decimal 2564)|
+-------------------------+--------------+
Look up VPN 1027 in page table:
Page Table[1027] = 0x2C3A8 (PPN = decimal 180,136)
Construct Physical Address:
+-------------------------+--------------+
| PPN | Offset |
| ...10 1100 0011 1010 1000| 1010 0000 0100|
+-------------------------+--------------+
(0x2C3A8) (0xA04)
Physical Address: 0x2C3A8A04
Verification:
Virtual: Page 1027, byte 2564 within page
Physical: Frame 180136, byte 2564 within frame
Offset unchanged ✓
Each page table entry contains more than just the physical page number. Control bits specify whether the page is valid (present in physical memory), whether it has been modified (dirty bit), whether it has been accessed recently (reference bit), and what permissions apply (read, write, execute). When a program tries to access a page that is not valid, a page fault exception occurs. The operating system's page fault handler then decides what to do: perhaps the page must be loaded from disk, or perhaps the access is illegal and the program should be terminated.
Page Table Entry Structure:
Typical PTE format (assuming 32-bit PTE for clarity):
+-----+---+---+---+---+---+-----------------------+
| V | D | R | W | X | OS| Physical Page# |
+-----+---+---+---+---+---+-----------------------+
31 30 29 28 27 26 25 0
V = Valid (1 = page present in memory, 0 = page fault)
D = Dirty (1 = page modified since loaded)
R = Referenced (1 = page accessed recently)
W = Writable (1 = writes allowed, 0 = read-only)
X = Executable (1 = can execute code from this page)
OS = Operating system bits (various uses)
Physical Page# = Physical frame number when V=1
Example PTEs:
Valid, clean, read-write page:
+---+---+---+---+---+---+-----------+
| 1 | 0 | 1 | 1 | 0 | 00| 0x01234 |
+---+---+---+---+---+---+-----------+
Maps to physical frame 0x01234, read-write, not dirty
Invalid page (on disk or never allocated):
+---+---+---+---+---+---+-----------+
| 0 | x | x | x | x | xx| xDisk Loc |
+---+---+---+---+---+---+-----------+
Access → Page Fault → OS loads from disk
Paging and Page Tables in Detail
Page tables can become very large. A 32-bit address space with 4 KB pages requires 2^20 page table entries (1,048,576 entries). If each PTE is 4 bytes, the page table consumes 4 MB of memory per process. On a 64-bit system with 4 KB pages, a single-level page table would require 2^52 entries (assuming 64-bit virtual addresses with 12-bit offset), which is completely impractical. Several techniques address this problem: multi-level page tables, inverted page tables, and Translation Lookaside Buffers (TLBs).
Multi-level page tables solve the space problem by organizing the page table hierarchically. Instead of one large flat array, we create a tree of smaller page tables. For a two-level scheme, the virtual page number is split into two parts: a page directory index and a page table index. The page directory is a single page containing pointers to second-level page tables. Each second-level page table contains the actual PTEs mapping to physical frames. Crucially, second-level page tables are allocated only for regions of the virtual address space actually in use.
Two-Level Page Table:
Virtual Address (32-bit, 4 KB pages):
+-----------+-----------+--------------+
| Dir | Table | Offset |
| Index | Index | |
| 10 bits | 10 bits | 12 bits |
+-----------+-----------+--------------+
31 22 21 12 11 0
Page Directory (1 page = 4 KB = 1024 PTEs):
+-----+
|PTE 0| → Page Table 0 (if allocated)
+-----+
|PTE 1| → Page Table 1 (if allocated)
+-----+
| ... |
+-----+
|1023 | → Page Table 1023 (if allocated)
+-----+
Each Page Table (1 page = 4 KB = 1024 PTEs):
+-----+
|PTE 0| → Physical Frame
+-----+
|PTE 1| → Physical Frame
+-----+
| ... |
+-----+
|1023 | → Physical Frame
+-----+
Address Translation Steps:
1. Extract Dir Index from VA
2. Read Page Directory[Dir Index] → get Page Table address
3. If Page Table not present → Page Fault
4. Extract Table Index from VA
5. Read Page Table[Table Index] → get PPN
6. If PTE invalid → Page Fault
7. Combine PPN with Offset → Physical Address
Let us trace an address translation through a two-level page table. Consider virtual address 0x003FF004 with the two-level structure described above. The directory index is bits [31:22]: 0x000 (decimal 0). The table index is bits [21:12]: 0x3FF (decimal 1023). The offset is bits [11:0]: 0x004. We first read the page directory at index 0, which gives us the address of page table 0. Then we read page table 0 at index 1023, which gives us the PPN. Finally, we concatenate the PPN with the offset.
Two-Level Translation Example:
Virtual Address: 0x003FF004
Binary: 00 0000 0000 | 11 1111 1111 | 0000 0000 0100
Parse address:
Dir Index = bits [31:22] = 0x000 (0)
Table Index = bits [21:12] = 0x3FF (1023)
Offset = bits [11:0] = 0x004 (4)
Step 1: Access Page Directory
Page Directory base = 0x10000000 (from CPU register)
PDE address = 0x10000000 + (Dir Index × 4)
= 0x10000000 + 0 = 0x10000000
Read PDE[0] → 0x20000001 (V=1, Page Table at 0x20000000)
Step 2: Access Page Table
Page Table base = 0x20000000 (from PDE)
PTE address = 0x20000000 + (Table Index × 4)
= 0x20000000 + (1023 × 4)
= 0x20000000 + 4092 = 0x20000FFC
Read PTE[1023] → 0xABCDE001 (V=1, PPN=0xABCDE)
Step 3: Construct Physical Address
Physical Address = (PPN << 12) | Offset
= (0xABCDE << 12) | 0x004
= 0xABCDE004
Result: VA 0x003FF004 → PA 0xABCDE004
Memory Accesses Required: 3 total
1. Page Directory Entry
2. Page Table Entry
3. Actual Data
Modern 64-bit systems use three, four, or even five levels of page tables to manage the vast virtual address space without consuming excessive memory. Each level adds another memory access to the translation process, but caching the translation in a TLB (discussed next) mitigates this overhead. The key advantage of multi-level tables is that huge portions of the virtual address space do not require any page table memory because the corresponding upper-level entries can simply be marked invalid.
The Translation Lookaside Buffer (TLB) is a small, fast cache dedicated to storing recent address translations. The TLB sits between the processor and the main page table in memory. When a memory access occurs, the TLB is checked first. If the translation is found (TLB hit), the physical address is available immediately without accessing the page table in memory. If the translation is not found (TLB miss), the page table walk occurs, and the resulting translation is inserted into the TLB for future use.
TLB Structure and Operation:
TLB (Fully or Set-Associative, ~64-512 entries):
+--------+--------+-------+-------+
| VPN | PPN | V D |R W X|
+--------+--------+-------+-------+
| 0x00403| 0x2C3A8| 1 0 |1 1 0|
+--------+--------+-------+-------+
| 0x00C00| 0x15678| 1 1 |1 1 1|
+--------+--------+-------+-------+
| ... | ... | ... | ... |
+--------+--------+-------+-------+
Memory Access with TLB:
Virtual Address
↓
+-------+
| TLB |
+-------+
↓
Hit?
/ \
Yes No (TLB Miss)
↓ ↓
| Page Table Walk
| (1, 2, or more memory accesses)
| ↓
| Insert into TLB
| ↓
└─────┘
↓
Physical Address
TLB Hit: 0 extra memory accesses (< 1 cycle)
TLB Miss: 1-4 extra memory accesses (10-40 cycles)
Typical TLB hit rates: 95-99%
Effective address translation overhead ≈ 1-2 cycles
TLB reach refers to the amount of memory that can be referenced without a TLB miss, calculated as the number of TLB entries multiplied by the page size. A TLB with 128 entries and 4 KB pages has a reach of 512 KB. If a program's working set exceeds the TLB reach, frequent TLB misses will occur, degrading performance. Using larger pages increases TLB reach; a 2 MB page size increases the reach to 256 MB with the same 128 entries. Many systems support multiple page sizes simultaneously to optimize TLB reach for different access patterns.
TLB Reach Calculation:
Configuration 1: 128 entries, 4 KB pages
TLB reach = 128 × 4 KB = 512 KB
Configuration 2: 128 entries, 2 MB pages
TLB reach = 128 × 2 MB = 256 MB
Impact on performance:
Working set = 100 KB → Both configs: ~100% hit rate
Working set = 1 MB → Config 1: ~50% hit rate
→ Config 2: ~100% hit rate
Working set = 512 MB → Config 1: ~0.1% hit rate
→ Config 2: ~50% hit rate
Virtual Memory Performance Considerations
Virtual memory performance depends on several factors: the TLB hit rate, the number of page table levels, whether pages are in memory or on disk, and the frequency of page faults. When operating in the common case with high TLB hit rates and all pages in memory, virtual memory overhead is minimal. However, pathological cases with frequent TLB misses or page faults can cripple performance.
The effective memory access time with virtual memory builds on our earlier cache analysis but adds the overhead of address translation. For a TLB hit, translation is nearly free (less than one cycle). For a TLB miss, we must walk the page table, requiring one memory access per level of the page table hierarchy. If the page table entries are themselves cached, these accesses may be fast, but a complete miss to main memory is expensive.
Effective Access Time with Virtual Memory:
Assumptions:
TLB hit rate: 98%
TLB hit time: 0.5 cycles (negligible)
Page table levels: 2
Page table in cache: 90% hit rate
Cache hit time: 3 cycles
Memory access time: 100 cycles
All pages in physical memory (no disk access)
TLB Hit Path (98% of accesses):
Translation: 0.5 cycles
Data access: AMAT from cache hierarchy (assume 3 cycles)
Total: 3.5 cycles
TLB Miss Path (2% of accesses):
Translation: Walk 2-level page table
Level 1: 0.9 × 3 + 0.1 × 100 = 12.7 cycles
Level 2: 0.9 × 3 + 0.1 × 100 = 12.7 cycles
Total translation: 25.4 cycles
Data access: 3 cycles
Total: 28.4 cycles
Effective Access Time:
EAT = 0.98 × 3.5 + 0.02 × 28.4
= 3.43 + 0.568
= 3.998 ≈ 4 cycles
With high TLB hit rate, overhead is modest (~14% increase over cache alone)
Page faults introduce far more severe overhead. When a page fault occurs, the operating system must locate the page on disk, select a victim page to evict from physical memory (if memory is full), write the victim page to disk if it is dirty, read the required page from disk, and update page tables. Disk access times are measured in milliseconds, corresponding to millions of processor cycles. A single page fault can cost as much as a million normal memory accesses.
Page Fault Overhead:
Disk access time: 10 milliseconds (typical SSD)
Processor cycle time: 0.5 nanoseconds (2 GHz CPU)
Page fault cost: 10 ms / 0.5 ns = 20,000,000 cycles
Normal memory access: ~200 cycles (cache miss to DRAM)
Page fault cost: ~100,000× normal access
If page fault rate = 0.01% (1 fault per 10,000 accesses):
EAT = 9,999 × 200 + 1 × 20,000,000
= 1,999,800 + 20,000,000
= 21,999,800 cycles / 10,000 accesses
= 2,200 cycles per access
(11× slowdown from 0.01% fault rate!)
This demonstrates why page fault rates must be kept extremely low
Target: < 0.0001% (< 1 fault per million accesses)
The working set model provides insight into paging performance. A program's working set is the collection of pages it accesses within a time window. If physical memory can hold the working set, page faults occur only when transitioning between phases of execution. If physical memory is smaller than the working set, thrashing occurs: the system constantly pages data in and out, spending more time on page faults than useful work. Operating systems attempt to allocate enough physical memory per process to contain its working set.
Demand paging, the most common virtual memory strategy, loads pages only when they are accessed, not in advance. When a process starts, none of its pages are in memory. As the program executes and touches pages, page faults bring them in. This lazy approach avoids loading pages that may never be used. Prepaging attempts to predict which pages will be needed and load them before they are accessed, reducing fault rates but risking wasted effort if predictions are wrong.
Bringing It All Together
The memory system of a modern computer is a marvel of engineering, spanning multiple technologies and abstraction layers that work in concert to provide the illusion of fast, large, and cheap memory. From the six-transistor SRAM cells in caches to the single-transistor DRAM cells in main memory, from the persistent storage of Flash to the protection mechanisms of virtual memory, each component serves a specific purpose in the hierarchy. Understanding how these pieces fit together is essential for comprehending computer architecture and writing efficient software.
At the foundation lie the semiconductor memory technologies. SRAM provides the speed necessary for caches, with its six-transistor cells maintaining state through active feedback, enabling access times of a few nanoseconds. DRAM trades speed for density, using single-transistor capacitive storage that requires periodic refresh but enables gigabytes of capacity at reasonable cost. Read-only memories from mask ROM through Flash provide persistent storage with varying degrees of programmability and performance. Each technology occupies its niche in the hierarchy based on its speed, density, cost, and volatility characteristics.
Memory organization transforms individual chips into functional systems. Chips combine in parallel to provide the required word width and in series to provide the required capacity. Bank selection and chip enables route addresses to the appropriate devices. Memory modules like DIMMs package these components into standardized, replaceable units. Inside DRAM chips, row and column multiplexing reduces pin count while organizing storage as two-dimensional arrays. These organizational details determine how the theoretical capabilities of memory technologies translate into practical system performance.
Caching bridges the processor-memory speed gap through careful exploitation of locality. The cache hierarchy with L1, L2, and L3 levels provides progressively larger but slower storage, with each level catching accesses that miss at the level above. Cache organization into lines, sets, and ways determines how addresses map to storage locations. Direct-mapped caches offer simplicity at the cost of conflicts. Set-associative caches provide flexibility with moderate hardware complexity. Fully-associative caches eliminate conflict misses but require expensive parallel tag comparison. Address breakdown into tag, index, and offset fields enables efficient lookup and data retrieval.
Replacement policies become critical when all available cache locations are occupied. LRU provides excellent performance by tracking access recency but requires complex hardware for high associativity. FIFO simplifies implementation by tracking only load order but fails to adapt to access patterns. Random replacement provides robustness with minimal hardware. Pseudo-LRU algorithms approximate true LRU with reduced cost. The choice of replacement policy reflects the tradeoff between hit rate optimization and implementation complexity.
Performance analysis quantifies the value of caching. Average memory access time calculations combine hit times, miss rates, and miss penalties across cache levels. A well-designed cache hierarchy reduces effective access time from hundreds of cycles to just a few cycles. Small changes in miss rates can dramatically impact overall performance, making cache optimization crucial. The interaction between cache parameters like size, associativity, block size, and replacement policy determines the hit rate for a given workload.
Memory interleaving increases bandwidth by distributing consecutive addresses across multiple banks that operate concurrently. Low-order interleaving spreads sequential accesses across banks automatically, maximizing throughput for common access patterns. High-order interleaving groups consecutive addresses together, benefiting workloads that access large contiguous blocks. The choice of interleaving scheme depends on expected access patterns and must consider bank conflicts when stride equals multiples of the bank count.
Virtual memory provides the abstraction that makes modern operating systems possible. Each process receives a private virtual address space that is mapped to physical memory through page tables. Address translation via the MMU converts virtual addresses to physical addresses transparently. Page table entries contain not just physical frame numbers but also protection bits and status flags. Multi-level page tables solve the size problem by allocating intermediate tables only for used regions of the virtual address space.
The Translation Lookaside Buffer caches recent address translations, eliminating the overhead of page table walks for most memory accesses. TLB reach, determined by the number of entries and page size, defines how much memory can be referenced without misses. High TLB hit rates keep virtual memory overhead low, while TLB misses trigger multi-level page table walks. Page faults when accessing non-resident pages involve disk access and cost millions of cycles, making page fault rates critical to system performance.
Every component of the memory hierarchy involves fundamental tradeoffs. SRAM versus DRAM trades speed for density. Cache size versus access time trades capacity for latency. Associativity versus complexity trades flexibility for implementation cost. Page size versus table size trades TLB reach for internal fragmentation. These engineering decisions shape the performance characteristics of every modern computer system.
The journey from understanding what caches are to analyzing virtual memory performance reveals the elegance and complexity of memory system design. Whether you are optimizing a tight inner loop to maximize cache hits, configuring page sizes for large scientific applications, debugging a performance problem caused by TLB thrashing, or designing the next generation of processors, deep knowledge of the entire memory hierarchy is essential. As processors continue to grow faster while memory speeds lag behind, the memory system will remain the critical bottleneck and the site of ongoing innovation in computer architecture.