This article has been machine-translated from Chinese. The translation may contain inaccuracies or awkward phrasing. If in doubt, please refer to the original Chinese version.

Chapter 3: Multi-Level Memory Systems

This chapter covers extensive content, mainly including various types of memory and storage methods. Key topics include basic memory concepts, DRAM, SRAM, cache, hit rate and average access time, main memory-to-cache mapping methods, and virtual memory.

3.1 Memory Overview

3.1.1 Classification of Memory

Memory is a device in a computer system for storing programs and data.
Storage media: Currently mainly semiconductor devices and magnetic materials.
Storage bit cell: A bistable semiconductor circuit, a CMOS transistor, or a magnetic material storage element can each store one binary code bit. This binary code bit is the smallest storage unit in memory, called a storage bit cell.
Storage unit: Multiple storage bit cells form a storage unit. Many storage units form a memory.

Different classification methods exist based on storage material properties and usage: (1) By storage media: magnetic surface / semiconductor memory (2) By access method: random / sequential access (magnetic tape) (3) By read/write capability: Read-Only Memory (ROM) and Random Access Memory (RAM) (4) By information volatility: volatile and non-volatile (5) By role in the storage system: main / auxiliary / cache / control memory

3.1.2 Memory Hierarchy

Current memory characteristics:

Fast memory is expensive and has small capacity;
Cheap memory is slow but has large capacity. When designing computer memory architecture, we desire large capacity, high speed, and low cost. Therefore, memory system design should balance capacity, speed, and cost, resulting in a multi-level memory architecture, as shown below:
Cache (high-speed buffer memory) is a high-speed, small-capacity semiconductor memory in the computer system.
Main memory (main storage) is the primary memory of the computer system, used to store large amounts of programs and data during operation.
External memory (auxiliary storage) is large-capacity auxiliary storage.

3.1.3 Technical Specifications of Main Memory

Word storage unit: A storage unit that holds one machine word; the corresponding unit address is called a word address.
Byte storage unit: A unit that holds one byte; the corresponding address is called a byte address.
Storage capacity: The total number of storage units a memory can hold. Greater capacity means more information can be stored.
Access time (also called memory access time): The time from issuing a read operation command to completion, when data is read out onto the data bus. Write operation time is usually taken as equal to read operation time, hence called memory access time.
Memory cycle: The minimum time interval required to initiate two consecutive read operations. Typically, the memory cycle is slightly longer than the access time, measured in ns.
Memory bandwidth: The amount of information stored or retrieved per unit time, usually measured in bits/second or bytes/second.

3.2 SRAM (Static Random Access Memory)

Currently widely used main memory (internal storage) is semiconductor memory. Based on different information storage mechanisms, it can be divided into two types:

Static RAM (SRAM): Fast access speed, but storage capacity is not as large as DRAM
Dynamic RAM (DRAM): Slightly slower access speed, but storage capacity is larger than SRAM.

3.2.1 Basic Static Storage Cell Array

Storage bit cell: A latch (flip-flop). As long as DC power supply is continuously applied to this memory circuit, it indefinitely maintains the stored 1 state or 0 state. If power is cut off, the stored data (1 or 0) will be lost.
Three groups of signal lines (Key Topic): Address lines, Data lines (row lines, column lines), Control lines
Address lines: If there are 6 lines, the memory capacity is 2⁶ = 64 storage units
Data lines: If there are 4 lines, the word length is 4 bits, so the total storage bit cells = 64 x 4 = 256.
Control lines: R/~W control line specifies whether to read from or write to memory

The address decoder has 64 output selection lines, called row lines, whose function is to open the input NAND gates of each storage bit cell.

3.2.2 Basic SRAM Logic Structure

Most SRAM chips use dual decoding to organize larger storage capacities.
Two-level decoding is employed: The address is divided into x-direction and y-direction parts as shown.
Storage array (256 rows x 128 columns x 8 bits)
Address decoder
- Uses dual decoding (to reduce the number of selection lines).
- A0~A7 are row address decode lines
- A8~A14 are column address decode lines
Bidirectional data lines: 8 lines

3.2.3 Read/Write Cycle Waveforms

Example 1: The figure shows the write timing diagram for SRAM. R/W is the read/write command control line. When the R/W line is at low level, the memory writes data from the data lines to memory at the given address. Please identify the errors in the write timing diagram and draw the correct write timing diagram.

Solution: The timing signals for writing to memory must be synchronized. Typically, when the R/W line applies a negative pulse, the address and data lines must have stable levels. When the R/W line reaches low level, data is immediately stored. Therefore, if data lines change while the R/W line is low, the memory will store the new data. Similarly, if address lines change while the R/W line is low, the same data will be stored at the new address. The correct write timing diagram is shown in figure (b).

3.3 DRAM (Dynamic Random Access Memory)

The storage cell of SRAM is a flip-flop with two stable states. The storage cell of DRAM consists of one MOS transistor and a capacitor forming a memory circuit. The MOS transistor serves as a switch, and whether the stored information is 1 or 0 is determined by the charge on the capacitor.

When the capacitor is fully charged, it represents stored 1;
When the capacitor is discharged and has no charge, it represents stored 0

DRAM differs from SRAM in the following ways:

Addition of row address latches and column address latches. Since DRAM has very large capacity, the address line width must increase accordingly, which would increase the number of address pins on the chip. To avoid this, time-multiplexed address transmission is used. If the address bus width is 10 bits, first transmit address A0-A9, latched by the Row Address Strobe (RAS) signal into the row address latch; then transmit address A10-A19, latched by the Column Address Strobe (CAS) signal into the column address latch. Combined internally, the address width reaches 20 bits, giving a storage capacity of 1M x 4 bits.
Addition of a refresh counter and associated control circuits. DRAM must be refreshed after reading, and unaccessed storage cells also need periodic refreshing. Refreshing is done by row, so the refresh counter length equals the row address latch length. Refresh operations alternate with read/write operations, so a 2-to-1 multiplexer provides either the refresh row address or the normal read/write row address.

3.3.3 Read/Write Cycle, Refresh Cycle (Key Topic)

Read/Write Cycle

The read cycle and write cycle are defined as the time from the falling edge of the RAS signal to the falling edge of the next RAS signal, i.e., the time interval between two consecutive read cycles. Typically, read cycle and write cycle times are equal for control convenience.

Refresh Cycle

DRAM storage bit cells are based on capacitor charge storage, and this charge decreases with time and temperature, so periodic refresh is necessary to maintain the correct stored information. There are two refresh methods: concentrated refresh and distributed refresh.

Concentrated Refresh

All rows of DRAM are refreshed within each refresh period. For example, for memory with an 8ms refresh period, concentrated refresh of all rows must occur every 8ms. The 8ms period is divided into two parts: the first part for normal read/write operations, and the latter part (from the end of normal read/write cycles to 8ms) for concentrated refresh operations.

Distributed Refresh

Each row's refresh is interleaved with normal read/write cycles. For example, the DRAM shown in p70 Figure 3.7 has 1024 rows. If the refresh period is 8ms, each row must be refreshed every 8ms / 1024 = 7.8us. Distributed refresh has no dead time!

3.4 Read-Only Memory (ROM) and Flash Memory

1. Mask ROM (MROM)

ROM with fixed storage content, provided by the manufacturer. Once the ROM chip is made, its storage content cannot be changed. Used for storing widely-used programs or data with standard functions, or user-customized programs or data with special functions (all in binary code).

Advantages: High reliability and integration density, low cost
Disadvantage: Cannot be rewritten

2. Programmable ROM

Users can modify storage content. Based on different programming operations, programmable ROM can be divided into:

One-Time Programmable (PROM) Feature: Users can modify certain storage cells in the product; users can program once. Advantage: Can be programmed according to user needs Disadvantage: Can only be written once
UV-Erasable Programmable (EPROM) Storage content can be written as needed; when updates are required, the original content is erased and new content is written.
Electrically Erasable Programmable (EEPROM)

3. Flash Memory

Flash memory is a high-density, non-volatile read/write memory.
High density means it has huge bit-count storage capacity.
Non-volatile means stored data can be preserved for extended periods without power.
It has the advantages of both RAM and ROM, representing an epoch-making advancement in storage technology.

Flash memory basic operations include programming, reading, and erasing.

3.5 Parallel Memory (Key Topic)

The speed mismatch between CPU and main memory has become a major problem limiting high-speed computer design.
To improve the data transfer rate between CPU and main memory, besides using higher-speed technology for main memory to reduce read time, parallel memory technology can also be employed.
Dual-port memory -- spatial parallelism technology
Multi-module interleaved memory -- temporal parallelism technology

3.5.1 Dual-Port Memory

1. Logical Structure of Dual-Port Memory

Dual-port memory gets its name from having two independent sets of read/write control circuits for the same memory. Due to parallel independent operations, it is a high-speed memory very useful in research and engineering. As an example, the logic block diagram of dual-port memory IDT7133 is shown below.

2. Conflict-Free Read/Write Control

When two ports have different addresses, read/write operations on both ports will definitely not conflict.
When either port is selected and driven, access to the entire memory is possible. Each port has its own chip select control (CE) and output drive control (OE).
During read operations, the port's OE (active low) opens the output driver, and data read from the storage matrix appears on the I/O lines.

3. Conflicting Read/Write Control

When two ports simultaneously access the same storage unit, read/write conflicts occur.
To resolve this, a BUSY flag is set. In this situation, the on-chip arbitration logic decides which port gets priority for read/write operations, while the other delayed port has its BUSY flag set (BUSY goes low), i.e., it is temporarily closed.

3.5.2 Multi-Module Interleaved Memory

1. Modular Organization of Memory

A main memory composed of several modules is linearly addressed. These addresses can be arranged in two ways among modules: sequential and interleaved

[Example] M0-M3, a total of four modules, each module has 8 words Sequential mode:　　M0: 0-7 　　　　　　　M1: 8-15 　　　　　　　M2: 16-23 　　　　　　　M3: 24-31 5-bit address organization: X X X X X 　High bits select the module, low bits select the address within the block

Feature: When one module performs access, other modules do not work.
Advantage: If one module fails, other modules can continue working normally. Expanding memory capacity by adding modules is convenient.
Disadvantage: Modules work serially, limiting memory bandwidth.

[Example] M0-M3, a total of four modules, each module has 8 words Interleaved mode:　　　M0: 0, 4, …… remainder 0 when divided by 4 　　　　　　　　M1: 1, 5, …… remainder 1 when divided by 4 　　　　　　　　M2: 2, 6, …… remainder 2 when divided by 4 　　　　　　　　M3: 3, 7, …… remainder 3 when divided by 4 5-bit address organization: X X X X X High bits select the address within the block, low bits select the module

Feature: Consecutive addresses are distributed across adjacent different modules; addresses within the same module are not consecutive.
Advantage: For block transfer of consecutive words, multi-module pipelined parallel access can be achieved, greatly improving memory bandwidth. Used for batch data reading.

2. Basic Structure of Multi-Module Interleaved Memory

The figure below shows the structural block diagram of a four-module interleaved memory. Main memory is divided into 4 mutually independent, equally-sized modules M0, M1, M2, M3, each with its own read/write control circuit, address register, and data register, each communicating with the CPU in an equal manner. Ideally, if program segments or data blocks are accessed consecutively in main memory, the main memory access speed will be greatly improved.

3.6 Cache Memory (Key Topic)

3.6.1 Basic Principles of Cache

1. Cache Functions

An important technology adopted to solve the speed mismatch between CPU and main memory
A small-capacity, high-speed buffer memory between CPU and main memory
Based on the locality principle of program access
Can provide instructions and data to the CPU at high speed, thereby accelerating program execution.
To pursue high speed, all functions including management are implemented in hardware.

Locality Principle of Program Access

The phenomenon where a program frequently accesses storage addresses within a local range during a relatively short time interval, while rarely accessing addresses outside this local range, is called program locality. Generally, cache uses high-speed SRAM, which is more expensive than main memory, but because its capacity is much smaller than main memory, it can better resolve the conflict between speed and cost.

2. Basic Cache Principles

Cache design basis: Data accessed by the CPU this time has a high probability of being accessed near the same location next time. (Locality of program access)
Data exchange between CPU and Cache is word-based
Data exchange between main memory and Cache is block-based
When the CPU reads a word from memory, it sends the memory address to both Cache and main memory. The Cache control logic determines based on the address whether the word is currently in Cache. If yes, the word is immediately delivered to the CPU; if no, during the main memory read cycle, the word is read from main memory and sent to the CPU, while simultaneously the entire data block containing this word is transferred from main memory to cache.

In the figure below, the cache is divided into 4 lines, each containing 4 words. Addresses allocated to cache are stored in a Content Addressable Memory (CAM), which is a content-addressable memory. When the CPU executes a memory access instruction, it sends the address of the word to be accessed to both the CAM and main memory. The address sent to the CAM is compared by content; if the word is not in cache, the word is found in main memory and transferred to the CPU. Simultaneously, a line of 4 consecutive words containing that word is loaded into cache.

3. Cache Structure

Cache data blocks are called lines, denoted L_i, where i=0, 1, ..., m-1
Main memory data blocks are called blocks, denoted B_j, where j=0, 1, ..., n-1
Lines and blocks are of equal length, each line (block) contains k main memory words
Cache consists of data memory and tag memory
- Data memory: Stores data from one main memory data block
- Tag memory: Stores the address information of where the data resides in main memory

4. Hit and Miss

Hit:

Main memory block loaded into cache
Main memory block and cache block have established a mapping relationship
The main memory block number mapped to a cache block is recorded by a tag

Miss:

Main memory block not loaded into cache
Main memory block and cache block have not established a mapping relationship

Hit Rate:

From the CPU's perspective, adding a cache aims to make the average read time of main memory approach cache's read time in terms of performance.
To achieve this, the proportion of memory accesses satisfied by cache should be very high, i.e., the cache hit rate should approach 1.
During program execution, let Nc represent the total number of accesses completed by cache, Nm represent the total number of accesses completed by main memory, and h is defined as the hit rate: h = Nc / (Nc + Nm)
If Tc represents the cache access time on a hit, Tm represents the main memory access time on a miss, and 1-h represents the miss rate, then the cache/main memory system's average access time Ta is: $T_a = h * T_c +(1-h) * T_m$
Our goal is to make the cache/main memory system's average access time T_a as close to T_c as possible with minimal hardware cost.
Let r represent the ratio of how much slower main memory is compared to cache $r = \frac{T_m}$ e represents access efficiency, then:

e = T_c / T_a \\ = \frac{T_c} {h*T_c +（1-h)*T_m }\\ = \frac{1} {h+（1-h）*r}\\ = \frac{1} {r+（1-r)*h}

i.e., $e = \frac{1} {r+（1-r)*h}$

From the expression, to improve access efficiency, the hit rate h should be as close to 1 as possible, and r should be 5-10, not too large.
The hit rate h depends on program behavior, cache capacity, organization, and block size.

Example 6: When the CPU executes a program, cache completes 1900 accesses and main memory completes 100 accesses. Given that cache access cycle is 50ns and main memory access cycle is 250ns, find the cache/main memory system efficiency and average access time. Solution: h = Nc / (Nc + Nm) = 1900/(1900+100) = 0.95 r = tm / tc = 250ns/50ns = 5 e = 1/(r+(1-r)h) = 1/(5+(1-5)x0.95 = 83.3% e = tc / ta ta = tc / e = 50ns/0.833 = 60ns

3.6.2 Address Mapping Between Main Memory and Cache

Compared to main memory, cache has very small capacity; its contents are only a subset of main memory, and cache-main memory data exchange is block-based.
To place main memory blocks into cache, some method must be used to map main memory addresses to cache, called address mapping.
Regardless of the mapping method chosen, both main memory and cache must be divided into "blocks" of the same size.
When choosing a mapping method, consider:
Is the hardware easy to implement?
Is address translation fast?
Is main memory space utilization high?
What is the probability of conflict when loading a block?
Below we introduce three mapping methods:
Fully associative mapping, direct mapping, set-associative mapping

Fully Associative Mapping

The address (block number) and content (words) of a main memory block are stored together in a cache line, with the block address stored in the tag portion of the cache line. Any main memory block can be mapped to any cache line.

The CPU memory access instruction specifies a memory address (composed of block number and word)
For fast retrieval, the block number in the main memory address is simultaneously compared with the tags of all cache lines in comparators. A match indicates a hit, and a word is read from cache at the word address; if not hit, the word is read from main memory at the memory address
Conversion formulas:
Main memory address length = (s+w) bits
Number of addressable units = 2^w words or bytes
Block size = line size = 2^w words or bytes
Number of main memory blocks = 2^s
Tag size = s bits
Number of cache lines = not determined by address format
Advantages: Any main memory block can be copied to any cache line, very flexible, high cache space utilization, low cache block conflict probability
Disadvantages: Comparators are difficult to implement, requiring a fast but expensive associative memory, only suitable for small-capacity cache

Direct Mapping

A many-to-one mapping relationship
A main memory block can only be copied to one specific cache line position
Cache line number i and main memory block number j have the following relationship:
i = j mod m (m is the total number of cache lines)
Content of main memory block j is copied to cache line i

Direct Mapping Retrieval Process

Use the line number to select the corresponding line;
Compare the line tag with the CPU access address. A match indicates a hit, accessing Cache; if not hit, access main memory and write the corresponding block into Cache
Conversion formulas
Main memory address length = (s+w) bits
Number of addressable units = 2^s+w words or bytes
Block size = line size = 2^w words or bytes
Number of main memory blocks = 2^s
Number of cache lines = m = 2^r
Tag size = (s-r) bits

Advantages: The comparison circuit is m times simpler, so hardware implementation is simple. Cache address is the lower bits of the main memory address, requiring no conversion. Disadvantages: High conflict probability Application: Suitable for large-capacity cache

Set-Associative Mapping

The advantages and disadvantages of fully associative mapping and direct mapping are exactly opposite. In terms of placement flexibility and hit rate, the former is superior; in terms of simpler comparator circuits and hardware investment, the latter is better.

Cache is divided into u sets, each set having v lines
Between main memory blocks and cache sets, direct mapping is used, i.e., which set a main memory block goes to is fixed; within each set, fully associative mapping is used, i.e., a main memory block can map to any line within a fixed cache set
Cache set number q and main memory block number j: q = j mod u (u is the total number of cache sets)
Content of main memory block j is copied to some line in cache set q
Address translation: Given main memory address x, to check if it is in cache, first y = x mod u, then search within set y

Easier to implement than fully associative, low conflict rate. If v=1, it becomes direct mapping. If u=1, it becomes fully associative mapping. v is generally small, usually a power of 2 (typical values are 2, 4, 8, 16), referred to as v-way set-associative cache.

Conversion formulas

Main memory address length = (s+w) bits
Number of addressable units = 2^s+w words or bytes
Block size = line size = 2^w words or bytes
Number of main memory blocks = 2^s
Lines per set = v
Number of sets u = 2^d
Number of cache lines = uv
Tag size = (s-d) bits

p96 Example 6: A set-associative cache consists of 64 lines, with 4 lines per set. Main memory contains 4K blocks, each block having 128 words. Show the format of a memory address. Solution: Block size = line size = 2^w words = 128 = 2⁷, so w = 7 Lines per set = v = 4 Number of cache lines = uv = 64, so number of sets u = 16 u = 2^d = 16 = 2⁴, so d = 4 Number of main memory blocks = 2^s = 4K = 2² * 2¹⁰ = 2¹², so s = 12 Address format:

Tag s-d	Set number d	Word number w
8 bits	4 bits	7 bits

3.6.3 Replacement Strategies

Cache operating principles require keeping the most recent data as much as possible.
When a new memory block needs to be copied into cache but all permissible line positions are occupied by other main memory blocks, replacement is needed.
Direct mapping: Direct replacement
Fully associative and set-associative: Select one line from the specific lines that can hold the new main memory block for eviction.
Three commonly used replacement algorithms:
Least Frequently Used (LFU) algorithm
Least Recently Used (LRU) algorithm
Random replacement

Least Frequently Used (LFU) Algorithm

Evict the line that has been accessed the fewest times within a period.
Each line has a counter. After a new line is established, counting starts from 0; each access increments the accessed line's counter by 1. When replacement is needed, the line with the smallest count value is evicted, and all counters are reset to zero.
This algorithm limits the counting period to the interval between two replacements of a specific line, and cannot reflect recent cache access patterns.

Least Recently Used (LRU) Algorithm

Evict the line that has not been accessed for the longest time recently.
Each line also has a counter. Each cache hit resets the hit line's counter to zero and increments all other lines' counters by 1. When replacement is needed, the line with the largest count value is evicted.
This algorithm protects newly copied data lines in cache, achieving a higher hit rate.

Random Replacement

Randomly select one line from the specific line positions for eviction
This strategy is easy to implement in hardware and is faster than the previous two strategies
The disadvantage is that randomly evicted data may be needed again soon, reducing hit rate and cache efficiency. However, this drawback diminishes as cache capacity increases.

3.7 Virtual Memory (Key Topic)

3.7.1 Basic Concepts of Virtual Memory

1. Real Address and Virtual Address

The address used when users write programs is called the virtual address or logical address, and its corresponding storage space is called the virtual memory space or logical address space.
The access address of the computer's physical memory is called the real address or physical address, and its corresponding storage space is called the physical storage space or main memory space.
The process of converting virtual addresses to real addresses during program execution is called program relocation.

2. Virtual Memory Access Process

User programs in virtual address space are compiled using virtual addresses and stored in auxiliary memory. During execution, the address translation mechanism loads part of the program into real memory based on the real address space allocated to the program.
For each memory access, first determine whether the part corresponding to the virtual address is in real memory:
If yes, perform address translation and access main memory using the real address;
Otherwise, use some algorithm to schedule part of the program from auxiliary memory into main memory, then access main memory in the same way.
Thus, each program's virtual address space can be much larger than the real address space, or much smaller.
The former aims to increase storage capacity, while the latter aims at address translation.
The latter typically occurs in multi-user or multi-tasking systems: real memory space is large, but individual tasks don't need large address spaces, so smaller virtual memory spaces can shorten instruction address field lengths.
Each program can have a virtual memory that has auxiliary memory capacity and access speed close to main memory. But this virtual memory is a conceptual model composed of main memory, auxiliary memory, and auxiliary memory management components, not an actual physical storage device. Virtual memory is implemented with additional hardware and software beyond main memory and auxiliary memory.

3. Similarities and Differences Between Cache and Virtual Memory

From the concept of virtual memory, the main memory-auxiliary memory access mechanism is similar to the cache-main memory access mechanism. These are two levels in the three-level storage hierarchy composed of cache, main memory, and auxiliary memory.
Between cache and main memory, and between main memory and auxiliary memory, there are auxiliary hardware and auxiliary software/hardware respectively responsible for address translation and management, enabling each level to form an integrated three-level storage hierarchy.
Cache and main memory constitute the system's internal memory, while main memory and auxiliary memory, supported by auxiliary software/hardware, constitute virtual memory.

In the three-level storage hierarchy, cache-main memory and main memory-auxiliary memory have many similarities:

Same purpose: Both are hierarchical storage systems built to improve the performance-to-cost ratio, both striving to make storage system performance approach high-speed memory while keeping cost and capacity close to low-speed memory.
Same principle: Both leverage the locality principle of program execution to move recently frequently used information blocks from relatively slow, large-capacity memory into relatively fast, small-capacity memory.

But cache-main memory and main memory-auxiliary memory also have many differences:

Different emphases: Cache primarily addresses the speed difference between main memory and CPU; regarding performance-to-cost improvement, virtual memory primarily addresses storage capacity, plus storage management, main memory allocation, and storage protection.
Different data paths: CPU has direct access paths to both cache and main memory; cache misses can directly access main memory. However, auxiliary memory relied upon by virtual memory has no direct data path to the CPU; main memory misses can only be resolved through page swapping, and the CPU ultimately still accesses main memory.
Different transparency: Cache management is entirely done by hardware, transparent to both system programmers and application programmers; virtual memory management is done jointly by software (OS) and hardware. Due to software involvement, virtual memory is not transparent to system programmers implementing storage management, but only transparent to application programmers (segment-based and segment-page management are "semi-transparent" to application programmers).
Different miss penalties: Since main memory access time is 5-10 times cache access time, while main memory access speed is typically thousands of times faster than auxiliary memory, main memory miss penalty is much greater than cache miss penalty.

4. Key Problems Virtual Memory Must Solve

Scheduling: Deciding which programs and data should be loaded into main memory.
Address mapping: Converting virtual addresses to main memory physical addresses during main memory access (called internal address translation), and converting virtual addresses to auxiliary memory physical addresses during auxiliary memory access (called external address translation) for page swapping. Additionally, main memory allocation, storage protection, and program relocation must be addressed.
Replacement: Deciding which programs and data should be evicted from main memory.
Update: Ensuring consistency between main memory and auxiliary memory. Under OS control, hardware and system software solve these problems for users, greatly simplifying application programming.

3.7.2 Page-Based Virtual Memory

1. Page-Based Virtual Memory Address Mapping

In page-based virtual memory systems, the virtual address space is divided into equal-sized pages called logical pages;
Main memory space is also divided into pages of the same size, called physical pages.
Virtual addresses consist of two fields: high field is the logical page number, low field is the intra-page address (offset);
Real memory addresses also have two fields: high field is the physical page number, low field is the intra-page address.
Virtual addresses (logical addresses) can be converted to physical addresses through a page table. The address mapping process is shown below.
In most systems, each process has one page table. Each entry in the page table corresponds to a virtual memory page, containing the main memory page address (physical page number) where that virtual page resides, and a valid bit indicating whether the logical page has been loaded into main memory.
During address translation, the logical page number is used as an offset within the page table to index the page table (treating the virtual page number as a page table array index) and find the corresponding physical page number. The physical page number serves as the high field of the real address, concatenated with the intra-page offset of the virtual address, forming the complete physical address.
The number of pages required by each process is not fixed, so page table length is variable. The common implementation stores the page table base address in a register while the page table itself resides in main memory. Since virtual address space can be very large, each process's page table can be very long. To save main memory space occupied by page tables, some systems store page tables in virtual memory, so page tables themselves undergo paging.
When a process runs, part of its page table is in main memory and part is in auxiliary memory.
Other systems use two-level page table structures. Each process has a page directory table, and each entry points to a page table. Therefore, if the page directory table length (number of entries) is m and each page table's maximum length (number of entries) is n, a process can have at most m x n pages.
In systems with large page tables, inverse page tables can also be used to implement reverse mapping from physical page numbers to logical page numbers.

2. Translation Lookaside Buffer (TLB)

Since page tables are usually in main memory, even if a logical page is already in main memory, at least two physical memory accesses are needed for one memory access, doubling virtual memory access time.
To avoid increasing main memory access count, the page table itself can be cached at two levels, storing the most active parts in high-speed memory to form a fast table.
This dedicated high-speed memory component for page table caching is usually called the Translation Lookaside Buffer (TLB).
The complete page table stored in main memory is called the slow table.
The TLB address mapping process is shown:

3. Internal and External Page Tables

The page table is a translation table from virtual addresses to main memory physical addresses, usually called the internal page table.
External page table: Used for translation between virtual addresses and auxiliary memory addresses.
When main memory has a page fault, the page-in operation must first locate auxiliary memory, and the external page table structure is closely related to the auxiliary memory addressing mechanism.
For example, for magnetic disks, the auxiliary memory address includes disk drive number, head number, track number, and sector number.

Main advantages of page-based storage:

Relatively high main memory utilization
Relatively simple page tables
Relatively fast address translation
Relatively easy disk management

Main disadvantages:

Poor program modularity
Long page tables requiring large storage space

3.7.3 Segment-Based and Segment-Page Virtual Memory

1. Segment-Based Virtual Memory

A segment is a region divided according to the natural boundaries of a program, with dynamically changeable length.
Programmers divide subroutines, operands, and constants into different segments, and each program can have multiple segments of the same type.
In segment-based virtual memory systems, virtual addresses consist of a segment number and an intra-segment address (offset). Virtual-to-real address translation is implemented through a segment table.
Each program has a segment table, with each entry corresponding to one segment. Each entry contains at least three fields:
Valid bit: Indicates whether the segment has been loaded into real memory.
Segment base address: Specifies the starting address of the segment in real memory (when the segment has been loaded).
Segment length: Records the actual length of the segment. The purpose of the segment length field is to prevent access to an address beyond the segment boundary when accessing a segment's address space, which would corrupt other segments. The segment table itself is a segment and can be stored in auxiliary memory but is generally resident in main memory.

Segment-based virtual-to-real address translation process:

Segment-based virtual memory has many advantages:

(1) The logical independence of segments makes compilation, management, modification, and protection easier, and facilitates sharing among multiple programs.
(2) Segment length can be dynamically changed, allowing free scheduling for effective use of main memory space. Disadvantages:
(1) Since segment lengths are not fixed, main memory space allocation is more complex.
(2) Many external fragments are easily left between segments, reducing storage space utilization.
(3) Since segment lengths are not necessarily powers of 2, the lowest bits of virtual and real addresses cannot simply serve as intra-segment offsets for direct concatenation with the segment number as in paging. Addition operations must be used to compute the physical address by summing the segment base address and intra-segment offset. Therefore, segment-based storage management requires more hardware support than page-based management.

2. Segment-Page Virtual Memory

Segment-page virtual memory combines segment-based and page-based virtual memory. Real memory is divided into equal-sized pages. Each program is first divided by logical structure into segments, then each segment is divided into pages matching real memory page size. Programs are loaded and evicted by page, but can be programmed, protected, and shared by segment.

[Example 1] Assume three programs with base numbers A, B, and C, with base register contents S_A, S_B, and S_C respectively. Program A consists of 4 segments, Program C consists of 3 segments. The logical-to-physical address translation process in a segment-page virtual memory system is shown. In main memory, each program has a segment table. Program A has 4 segments, Program C has 3 segments. Each segment should have a page table; each row in the segment table indicates the starting position of the corresponding page table, and each row in the page table is the corresponding physical page number. Describe the virtual-to-real address translation process.

Solution: The address translation process is as follows: (1) The memory management unit finds the C-th entry in the segment table base register table based on base number C, obtaining Program C's segment table base address SC. Then, based on segment number S(=1), find the S-th entry in Program C's segment table to get the page table starting address b of segment S. (2) Using the intra-segment logical page number P(=2) to index the page table, obtain the physical page number (10 in the figure). (3) Concatenating the physical page number with the intra-page address offset yields the physical address. If the computer system has only one base register, the base number is unnecessary. When switching between multiple programs, the OS modifies the base register contents.

As can be seen, the disadvantage of segment-page virtual memory is that multiple table lookups are needed during virtual-to-real address mapping, resulting in relatively high implementation complexity.

3.7.4 Virtual Memory Replacement Algorithms

When pages are loaded from auxiliary memory to main memory while main memory is full, main memory page replacement is also needed. Virtual memory replacement algorithms are similar to cache replacement algorithms: FIFO, LRU, LFU, etc.

Virtual memory replacement algorithms differ from cache replacement algorithms in:

(1) Cache replacement is entirely hardware-implemented, while virtual memory replacement has OS support.
(2) Virtual memory page faults have a much greater impact on system performance than cache misses, because page loading requires accessing auxiliary memory and task switching.
(3) Virtual memory page replacement has wide selection latitude; any page belonging to a process can be replaced.

Chapter Summary

Memory requirements are large capacity, high speed, and low cost. To resolve these three contradictions, computers adopt multi-level memory architecture, namely cache, main memory, and external memory. The CPU can directly access internal memory (cache, main memory) but cannot directly access external memory. Memory technical specifications include storage capacity, access time, memory cycle, and memory bandwidth.
Widely used SRAM and DRAM are both semiconductor random access memories. The former is faster than the latter but not as highly integrated. Both have advantages of small size, high reliability, and low cost, with the disadvantage that data is lost when power is cut.
Read-only memory and flash memory compensate for SRAM and DRAM's disadvantage by retaining stored data even after power-off. Flash memory in particular provides high performance, low power consumption, high reliability, and mobility, representing an entirely new memory architecture.
Dual-port memory and multi-module interleaved memory are parallel memory structures. The former uses spatial parallelism, the latter temporal parallelism. Both are extensively used in research and engineering.
Cache is a high-speed buffer memory, an important hardware technology for solving the speed mismatch between CPU and main memory, and has evolved into multi-level cache systems with separate instruction cache and data cache. The cache hit rate should approach 1. Main memory-to-cache address mapping has three methods: fully associative, direct, and set-associative. Set-associative is a compromise between the first two, moderately combining the advantages of both while minimizing their disadvantages. In terms of flexibility, hit rate, and hardware investment, it is relatively ideal and has been widely adopted.
User programs are compiled using virtual addresses (logical addresses) and stored in auxiliary memory. During execution, the address translation mechanism loads part of the program into real memory (physical storage space or main memory space) based on the real address space allocated at that time. The OS, with hardware support, performs virtual-to-real address translation, a process called program relocation. Each memory access first checks whether the part corresponding to the virtual address is in real memory: if yes, address translation is performed and main memory is accessed using the real address; otherwise, part of the program is scheduled from auxiliary memory into main memory using some algorithm, then accessed the same way. For applications, if main memory hit rate is high, virtual memory access time approaches main memory access time, and virtual memory size depends only on auxiliary memory size.
Virtual memory mechanisms must solve key problems including scheduling, address mapping, and replacement. Under OS control, hardware and system software solve these problems for users, greatly simplifying application programming.
In page-based virtual memory systems, virtual address space and main memory space are divided into equal-sized pages, and page tables convert virtual addresses to physical addresses. To avoid increasing main memory access count, the page table can be cached at two levels, storing the most active parts in the Translation Lookaside Buffer (TLB).
The disadvantage of paging is that page length is unrelated to program logical size, while segmentation can divide memory space into dynamically-sized storage regions according to program natural boundaries. In segment-based virtual memory systems, virtual addresses consist of segment number and intra-segment address (offset). Virtual-to-real address translation is implemented through segment tables.
Segment-page virtual memory combines segment-based and page-based virtual memory. Programs are loaded and evicted by page but can be programmed, protected, and shared by segment. Virtual memory also addresses storage protection. In virtual memory systems, page table protection, segment table protection, and key-based protection methods are commonly used for storage region protection. Access mode protection based on how main memory information is used can also be combined.

Computer Organization Review Summary (3): Multi-Level Memory Systems