Computer organization and architecture form the foundational blueprint that defines how computer systems are designed, constructed, and operated. This field encompasses the study of the internal workings, structural components, and operational methodologies that enable computers to process information, execute programs, and communicate with external devices.
Computer architecture refers to the abstract specifications and attributes of a computing system that are visible to a programmer or compiler writer. These include the instruction set, data representation, addressing modes, and input/output mechanisms. Architecture defines the interface between hardware and software, establishing the rules and conventions that software must follow to communicate with hardware.
Computer organization, conversely, deals with the physical implementation and operational units that realize the architectural specifications. It encompasses the detailed hardware design, including datapaths, control units, memory hierarchies, and input/output subsystems. Organization addresses how architectural specifications are implemented efficiently, considering factors like speed, power consumption, cost, and reliability.
The scope of computer organization and architecture extends across multiple abstraction levels, from digital logic circuits and microarchitecture to system-level design and parallel processing. This comprehensive view enables designers to make informed trade-offs between performance, cost, power consumption, and other critical metrics.
Understanding computer organization and architecture is essential for several reasons. System designers use this knowledge to create efficient and powerful computing systems. Software developers benefit from understanding hardware capabilities and limitations, enabling them to write optimized code. System administrators require architectural knowledge to configure and maintain computing infrastructure effectively. Researchers push boundaries by exploring novel architectural paradigms and computing models.
The distinction between computer architecture and computer organization represents a fundamental concept in computing. While these terms are often used interchangeably in casual conversation, they refer to distinct aspects of system design with different levels of abstraction, visibility, and design considerations.
Computer architecture constitutes the abstract interface between hardware and software. It defines the functional behavior of a computing system as seen by a machine language programmer or compiler writer. Key architectural elements include the instruction set architecture (ISA), which specifies available instructions, their encoding, and their effects; addressing modes that determine how operands are accessed; register set configuration and usage conventions; data type representations including integer and floating-point formats; memory organization and addressing; and input/output mechanisms and interrupt handling.
The architecture establishes a contract between hardware and software. Once defined, software can be developed to run on any implementation that adheres to the architectural specification, regardless of the underlying organization. This abstraction enables software portability across different implementations of the same architecture.
Computer organization, sometimes called microarchitecture, encompasses the physical implementation details that realize the architectural specification. These implementation choices are typically invisible to the programmer but significantly impact performance, power consumption, and cost. Organizational elements include datapath design and width, control unit implementation (hardwired or microprogrammed), memory hierarchy structure including cache sizes and organizations, pipeline depth and hazard handling mechanisms, superscalar issue width and execution unit configuration, and bus structures and interconnection networks.
The relationship between architecture and organization can be understood through analogies. Architecture resembles a building blueprint specifying room layout, while organization corresponds to construction materials and methods. Architecture defines what a processor does, while organization determines how it does it.
Multiple organizations can implement the same architecture. For example, Intel's x86 architecture has been implemented through numerous microarchitectures including 8086, 286, 386, Pentium, Core, and Xeon designs. Each implementation maintains architectural compatibility while employing vastly different organizational approaches, enabling performance improvements without requiring software redesign.
Conversely, the same organizational principles can implement different architectures. RISC design philosophy has been applied to architectures including ARM, MIPS, PowerPC, and RISC-V, each with unique instruction sets but similar organizational characteristics.
The boundary between architecture and organization can blur in some contexts. Features like cache behavior, while technically organizational, may affect performance-optimized software. Pipeline characteristics can influence instruction scheduling decisions. Nevertheless, maintaining clear conceptual separation enables modular design and progressive refinement.
The mechanical era of computing spans from ancient calculating devices through early nineteenth-century analytical engines. This period established fundamental concepts that would later be implemented electronically.
The abacus, developed around 2400 BCE in Mesopotamia, represents one of the earliest computing devices. This simple mechanical calculator enabled rapid arithmetic through bead manipulation, establishing principles of positional representation and manual computation.
In the seventeenth century, Wilhelm Schickard constructed the first known mechanical calculator in 1623, capable of adding and subtracting six-digit numbers. Blaise Pascal's Pascaline (1642) improved on this design, becoming the first commercially manufactured calculator. Pascal's device used geared wheels to perform addition and subtraction through mechanical digit rotation.
Gottfried Wilhelm Leibniz developed the Stepped Reckoner in 1673, advancing mechanical calculation to include multiplication and division through repeated addition and subtraction. Leibniz also formalized binary arithmetic, establishing theoretical foundations that would prove essential for electronic computing.
The nineteenth century witnessed the most sophisticated mechanical computing devices. Charles Babbage designed the Difference Engine (1822) to automatically compute polynomial functions for mathematical tables. Although never fully completed during his lifetime, the Difference Engine demonstrated automatic computation through mechanical means.
Babbage's Analytical Engine (1837) represents the watershed achievement of mechanical computing. This design incorporated fundamental architectural elements still used today: an arithmetic logic unit (the "mill"), memory (the "store"), conditional branching, and program control through punched cards. Ada Lovelace recognized the Analytical Engine's potential beyond pure calculation, developing algorithms and establishing principles of programming.
Herman Hollerith's tabulating machines (1880s) applied electromechanical technology to data processing. Using punched cards for data storage and electrical sensing for tabulation, these systems accelerated the 1890 U.S. census and established foundations for the data processing industry.
The vacuum tube era (1940s-1950s) marked the transition from mechanical to electronic computing, enabling dramatically faster operation through electronic switching.
The Atanasoff-Berry Computer (ABC, 1937-1942) pioneered several electronic computing concepts. John Atanasoff and Clifford Berry built this machine at Iowa State College, implementing binary arithmetic, regenerative memory, and electronic switching. Although not a general-purpose computer, the ABC demonstrated the viability of electronic computation.
Colossus (1943-1944), developed by Alan Turing and Tommy Flowers at Bletchley Park, represented the first programmable electronic computer. Designed for cryptographic analysis during World War II, Colossus used vacuum tubes for logic and paper tape for input, successfully breaking German Lorenz cipher traffic. The project's secrecy delayed recognition of its contributions.
The Harvard Mark I (1944), developed by Howard Aiken with IBM support, combined electromechanical and electronic elements. This massive machine, 51 feet long and weighing five tons, demonstrated automatic computation but was quickly surpassed by fully electronic designs.
The Electronic Numerical Integrator and Computer (ENIAC, 1943-1946), developed by J. Presper Eckert and John Mauchlin at the University of Pennsylvania, stands as the first general-purpose electronic computer. ENIAC contained 17,468 vacuum tubes, occupied 1,800 square feet, and consumed 150 kilowatts of power. Programming required physical rewiring and switch setting, reflecting pre-stored program concepts.
The Manchester Baby (1948), built at the University of Manchester, executed the first stored program in June 1948. This small machine demonstrated the practical feasibility of the stored-program concept, running a simple factoring program from electronic memory.
The transistor, invented at Bell Laboratories in 1947 by John Bardeen, Walter Brattain, and William Shockley, revolutionized computer design through smaller size, lower power consumption, and greater reliability than vacuum tubes.
The TX-0 (Transistorized Experimental computer, 1956) at MIT's Lincoln Laboratory demonstrated the potential of transistor-based computing. This 18-bit machine influenced subsequent minicomputer design and provided an early interactive computing environment.
The IBM 7090 (1959), a fully transistorized version of the vacuum tube 709, established transistor-based mainframe computing. This machine offered improved reliability and reduced power consumption, making commercial computing more practical.
The PDP-1 (Programmed Data Processor, 1960) from Digital Equipment Corporation initiated the minicomputer era. Smaller and more affordable than mainframes, the PDP series brought computing to laboratories and universities. The PDP-11 (1970) became particularly influential, establishing architectural concepts that influenced later processors including the x86 family.
Transistor-based systems enabled new applications and usage models. Time-sharing systems allowed multiple users simultaneous access, interactive computing replaced batch processing, and real-time applications became feasible. These developments laid groundwork for personal computing.
The integrated circuit, independently developed by Jack Kilby at Texas Instruments (1958) and Robert Noyce at Fairchild Semiconductor (1959), combined multiple transistors on a single semiconductor substrate. This innovation dramatically reduced size, cost, and power consumption while improving reliability and performance.
The IBM System/360 (1964) represented the first major computer family designed around integrated circuits. This unified architecture across multiple models established principles of compatibility and upgradeability that remain fundamental. The System/360 introduced microprogramming, enabling complex instruction sets while simplifying implementation.
The Intel 4004 (1971), the first microprocessor, integrated a complete CPU on a single chip. Designed for Busicom calculators, this 4-bit processor contained 2,300 transistors and operated at 740 KHz. The 4004 demonstrated that complete processors could be manufactured economically, initiating the microprocessor revolution.
The 8-bit processors including Intel 8008 (1972) and 8080 (1974), Motorola 6800 (1974), and Zilog Z80 (1976) enabled early personal computers. These chips provided sufficient capability for general-purpose computing while remaining affordable for hobbyists and small systems.
Very Large Scale Integration (VLSI) and Ultra Large Scale Integration (ULSI) exponentially increased transistor counts, enabling increasingly sophisticated processor designs.
The 16-bit era brought significant advances. The Intel 8086 (1978) established the x86 architecture that dominates personal computing today. The Motorola 68000 (1979) powered early Apple Macintosh computers and many other systems with its clean, 32-bit internal architecture.
32-bit processors including the Intel 80386 (1985) and Motorola 68020 (1984) provided flat addressing and enhanced performance. The ARM architecture, developed by Acorn Computers (1985), introduced RISC principles to embedded and mobile computing, eventually dominating these markets.
The 64-bit transition enabled larger address spaces and improved performance. AMD's Opteron (2003) brought 64-bit computing to the x86 architecture, while Intel's Itanium pursued a different approach with explicitly parallel instruction computing (EPIC). ARMv8-A (2011) added 64-bit support to the ARM architecture.
Multi-core processors emerged as clock frequency scaling encountered physical limitations. Dual-core processors appeared in 2005, quickly evolving to many-core designs. Contemporary processors integrate billions of transistors, combining multiple CPU cores, graphics processors, memory controllers, and specialized accelerators on single chips.
Computer generations provide a historical framework for understanding technological evolution. While precise boundaries remain debated, this classification illuminates major transitions in computing technology.
First Generation (1940s-1950s): Vacuum Tubes characterized early electronic computers. Vacuum tubes served as switching elements, enabling electronic computation but consuming substantial power and generating significant heat. These machines used magnetic drums for memory, punched cards for input/output, and machine language programming. ENIAC, UNIVAC I, and IBM 701 exemplified this generation. Reliability challenges required frequent maintenance, and programming remained accessible only to specialists.
Second Generation (1950s-1960s): Transistors replaced vacuum tubes, dramatically improving reliability and reducing power consumption. Magnetic core memory provided faster, more reliable storage. High-level languages including FORTRAN and COBOL emerged, making programming more accessible. Operating systems appeared, managing resources and enabling batch processing. IBM 7090, CDC 1604, and PDP-1 represented this generation's diverse systems.
Third Generation (1960s-1970s): Integrated Circuits combined multiple transistors on single chips, further reducing size and cost while improving performance. IBM System/360 introduced family compatibility and microprogramming. Minicomputers like PDP-8 and PDP-11 brought computing to smaller organizations. Time-sharing operating systems enabled interactive use and resource sharing.
Fourth Generation (1970s-1990s): VLSI Microprocessors placed complete CPUs on single chips. The Intel 4004 initiated this era, followed by increasingly capable processors. Personal computers became widespread with Apple II, IBM PC, and subsequent systems. Graphical user interfaces, networking, and client-server computing transformed how people used computers. Workstations from Sun, HP, and others provided powerful desktop computing.
Fifth Generation (1990s-Present): ULSI and Parallel Processing continues today with billions of transistors on single chips. Multiple processor cores, graphics processing units, and specialized accelerators provide massive computational capability. Mobile computing, cloud services, artificial intelligence, and the Internet of Things represent contemporary applications. Emerging technologies including quantum computing and neuromorphic systems may define future generations.
The stored program concept, also called the von Neumann architecture, revolutionized computing by treating program instructions as data that could be stored in memory and manipulated like any other data. This seemingly simple insight enabled flexible, reprogrammable computing systems.
John von Neumann articulated this concept in his "First Draft of a Report on the EDVAC" (1945), building on discussions with Eckert, Mauchly, and others involved in ENIAC development. The report described a stored-program computer architecture that became the template for most subsequent systems.
Key implications of the stored program concept include:
Programmability: Instructions stored in memory can be easily modified, enabling different programs to run on the same hardware without physical reconfiguration. A computer can load a word processor, spreadsheet, or game from storage into memory and execute appropriate instructions.
Self-Modifying Code: Programs can modify their own instructions during execution, enabling adaptive algorithms and efficient implementations. While modern programming practices discourage self-modification due to complexity and security concerns, the capability remains available at the machine level.
Programs as Data: Software can be processed like any other data. Compilers can read source code as input and produce executable machine code as output. Debuggers can examine and modify running programs. Operating systems can load programs from disk into memory.
Unified Memory: Instructions and data share the same memory space, simplifying hardware design and enabling flexible memory allocation. However, this sharing also creates security vulnerabilities exploited through buffer overflow attacks.
Sequential Execution: The program counter maintains execution sequence, automatically advancing through memory to fetch and execute successive instructions. Branch instructions modify the program counter to implement loops, conditionals, and subroutine calls.
The stored program concept remains fundamental despite architectural variations. Harvard architectures separate instruction and data memory for performance while maintaining the conceptual unity of stored programs. Modern systems employ sophisticated memory hierarchies and protection mechanisms while preserving the essential insight that programs reside in memory as modifiable data.
The von Neumann architecture, named for John von Neumann who documented its principles, describes the fundamental organization of most computers. This architecture specifies a processing unit, memory, input/output, and the interconnections among these components.
Central Components:
The memory unit stores both instructions and data in a single address space. Memory consists of individually addressable locations, each capable of storing a fixed number of bits (typically 8 bits/byte in modern systems). Memory operations support reading and writing data at specified addresses.
The arithmetic logic unit (ALU) performs computational operations including arithmetic (addition, subtraction, multiplication, division) and logical operations (AND, OR, NOT, XOR). The ALU operates on data fetched from memory or registers.
The control unit orchestrates system operation by sequencing instruction fetch, decode, and execution. It generates control signals directing data movement and ALU operations based on instruction interpretation.
Registers provide high-speed storage within the processor. The program counter (PC) holds the memory address of the next instruction. The instruction register (IR) contains the currently executing instruction. General-purpose registers temporarily hold operands and results.
Input/output mechanisms enable communication with external devices including keyboards, displays, storage, and networks. I/O may be memory-mapped or handled through specialized instructions.
System bus connects components, carrying address, data, and control information. The address bus specifies memory or I/O locations, the data bus transfers information, and the control bus coordinates operations.
Operational Cycle:
The von Neumann machine executes programs through repeated instruction cycles:
-
Fetch: The control unit places the program counter address on the address bus and initiates a memory read. Memory returns the instruction at that address, which is loaded into the instruction register. The program counter increments to point to the next instruction.
-
Decode: The control unit interprets the instruction in the instruction register, determining the operation to perform and operand locations. This may involve instruction format analysis, opcode decoding, and addressing mode interpretation.
-
Execute: The control unit activates appropriate components to perform the specified operation. This may involve ALU computation, data movement between registers and memory, or I/O operations.
-
Store (if needed): Results may be written back to registers or memory.
Characteristics and Limitations:
The von Neumann architecture's simplicity and flexibility contributed to its widespread adoption. However, it exhibits inherent limitations:
The von Neumann bottleneck arises from shared memory access for instructions and data. Since the single memory and bus must service both instruction fetches and data accesses, performance can be constrained when both compete for bandwidth.
Sequential execution means instructions process one at a time, limiting inherent parallelism. While pipelining and multiple execution units mitigate this limitation, the fundamental sequential model persists.
Memory access latency creates performance challenges as processor speeds outpace memory improvements. Cache hierarchies partially address this gap through temporal and spatial locality exploitation.
The Harvard architecture, originating from the Harvard Mark I electromechanical computer, separates instruction and data memory into distinct address spaces with independent access paths. This organization addresses the von Neumann bottleneck by enabling simultaneous instruction fetch and data access.
Distinguishing Features:
Separate memory spaces maintain instructions in one memory system and data in another. Each memory has its own address space, bus connections, and access paths. Processors can fetch instructions while simultaneously reading or writing data.
Dedicated buses provide independent pathways for instruction and data transfers. This parallelism enables higher throughput by overlapping operations that would conflict in unified memory systems.
Specialized memory technologies can optimize each memory for its specific access patterns. Instruction memory may favor sequential access and large block transfers, while data memory may emphasize random access and low latency.
Design Variations:
Pure Harvard architecture maintains complete separation throughout the memory hierarchy, with distinct caches and memory systems for instructions and data. This approach maximizes parallelism but increases complexity and may limit flexibility.
Modified Harvard architecture, used in most modern processors, maintains separate instruction and data caches while sharing a unified main memory. This hybrid approach provides most parallelism benefits while simplifying memory management and enabling efficient use of physical memory.
Advantages:
Concurrent access enables simultaneous instruction fetch and data operations, increasing effective bandwidth and reducing structural hazards. Pipelined processors particularly benefit from this capability.
Protection mechanisms can enforce distinct access permissions for instruction and data memory, potentially improving security and reliability. Code regions can be protected from modification while data regions remain writable.
Optimized memory technologies allow instruction memory to use wider words or different timing parameters than data memory, tailoring each to its typical usage patterns.
Cache efficiency improves because instruction and data accesses follow different patterns. Separate caches avoid competition between code and data for cache space while enabling access-specific replacement policies.
Disadvantages:
Complexity increases from managing two memory systems with independent address spaces and access paths. This complexity extends to compilers, linkers, and operating systems that must coordinate both spaces.
Memory utilization may suffer when one memory space overflows while the other has available capacity, since unused space cannot be repurposed without hardware support.
Program modification becomes challenging because self-modifying code must write to instruction memory, which may be read-only or require special operations. Most modern systems discourage self-modifying code regardless of architecture.
Modified Harvard architecture represents the dominant contemporary approach, combining benefits of both von Neumann and Harvard designs. This organization maintains separate instruction and data caches while using unified main memory, achieving most Harvard advantages without its flexibility limitations.
Architectural Characteristics:
The cache hierarchy implements Harvard separation at levels closest to the processor. L1 instruction cache (I-cache) and L1 data cache (D-cache) operate independently, enabling concurrent access during instruction fetch and data operations. Higher cache levels (L2, L3) typically remain unified, sharing capacity between instructions and data.
Memory management unifies the physical address space while maintaining logical distinctions. Virtual memory mechanisms can enforce different permissions for code and data pages while mapping both to the same physical memory.
Bus structure may include separate internal pathways for instruction and data transfers between caches and processor, while external buses often remain unified for main memory access.
Operational Benefits:
Concurrent access continues through cache-level separation. Processors can simultaneously fetch instructions from I-cache while loading or storing data to D-cache, maintaining throughput advantages.
Capacity sharing at higher cache levels improves memory utilization. When instruction working set exceeds I-cache capacity but data working set is small, unified L2 can allocate more space to instructions, adapting to program behavior.
Simplified programming maintains the von Neumann model familiar to programmers. Self-modifying code, though discouraged, remains possible through cache coherence mechanisms that ensure instruction and data views remain consistent.
Implementation Considerations:
Cache coherence between instruction and data caches must be maintained when code modification occurs. Modified Harvard processors implement mechanisms to detect when data writes affect instruction memory, invalidating or updating corresponding instruction cache entries.
Memory protection can be enforced through page-level permissions. Code pages may be marked read-only and executable, while data pages permit reads and writes but disallow execution, enhancing security.
Performance monitoring must account for separate cache statistics when analyzing program behavior and optimizing performance.
Quantifying computer performance requires comprehensive metrics addressing different aspects of system capability. No single measure captures all relevant dimensions, so designers and users consider multiple metrics depending on application requirements.
Clock rate, measured in Hertz (cycles per second), specifies the frequency at which a processor executes instructions. Higher clock rates generally indicate faster instruction execution, assuming constant cycles per instruction.
Measurement units range from megahertz (MHz, millions of cycles per second) for simple microcontrollers to gigahertz (GHz, billions of cycles per second) for modern processors. Contemporary desktop processors operate at 3-5 GHz, while mobile processors may use lower frequencies for power efficiency.
Physical limitations constrain clock rate scaling. Higher frequencies increase power consumption proportionally (dynamic power ∝ frequency × voltage² × activity factor) and generate more heat requiring dissipation. Clock distribution across large chips becomes challenging at high frequencies due to signal propagation delays and skew.
Diminishing returns characterize clock rate scaling beyond certain points. Pipeline depth must increase to maintain throughput at higher frequencies, but deeper pipelines incur greater branch misprediction penalties and complexity.
Cycles Per Instruction (CPI) and Instructions Per Cycle (IPC) provide complementary views of processor efficiency. CPI measures average clock cycles required per instruction executed, while IPC (the reciprocal) indicates average instructions completed per cycle.
Ideal CPI would be 1.0 for scalar processors executing one instruction per cycle, but various factors increase CPI: cache misses requiring memory access, branch mispredictions flushing pipeline stages, data hazards causing stalls, and structural hazards from resource contention.
Average CPI calculation considers instruction mix and associated penalties:
Average CPI = Σ (Instruction frequency × CPI for that instruction type)
For example, if a processor executes 30% loads (2 cycles each), 20% stores (2 cycles), 40% ALU operations (1 cycle), and 10% branches (3 cycles including misprediction penalty), average CPI = 0.3×2 + 0.2×2 + 0.4×1 + 0.1×3 = 0.6 + 0.4 + 0.4 + 0.3 = 1.7 cycles.
IPC measurement often provides more intuitive performance comparison, especially for superscalar processors executing multiple instructions per cycle. A 4-wide superscalar processor achieving 2.5 IPC completes 2.5 instructions each cycle, equivalent to 0.4 CPI.
Million Instructions Per Second (MIPS) rates instruction execution throughput, while Million Floating-Point Operations Per Second (MFLOPS) specifically measures floating-point computational capability.
MIPS calculation depends on clock rate and average CPI:
MIPS = (Clock rate in MHz) / (Average CPI)
A 3 GHz processor averaging 1.5 CPI achieves 3000/1.5 = 2000 MIPS (2 GIPS).
MIPS limitations include instruction set dependence (RISC processors typically achieve higher MIPS than CISC processors for equivalent workloads), program dependence (different instruction mixes yield varying MIPS), and comparability issues across architectures (MIPS doesn't account for instruction complexity differences).
MFLOPS measurement provides more focused comparison for scientific and engineering applications. Modern processors achieve hundreds of GFLOPS (billions of floating-point operations per second) through vector units and multiple cores. Peak theoretical MFLOPS depends on floating-point unit configuration, while achieved MFLOPS depends on workload characteristics and memory bandwidth.
Amdahl's Law, formulated by Gene Amdahl in 1967, quantifies the theoretical speedup achievable by improving part of a system. This fundamental principle guides parallel processing expectations and optimization efforts.
Mathematical formulation:
Speedup = 1 / [(1 - P) + P/S]
Where:
- P is the proportion of execution time affected by the improvement
- S is the speedup factor for that portion
Example: If 40% of program execution can be improved by a factor of 4, maximum speedup = 1 / [(0.6) + (0.4/4)] = 1 / [0.6 + 0.1] = 1 / 0.7 ≈ 1.43×.
Key insights from Amdahl's Law include:
Diminishing returns characterize improvements to portions that don't dominate execution. Even infinite speedup of the improved portion yields maximum speedup of 1/(1-P). If P=0.5, maximum speedup is 2× regardless of how much the improved part accelerates.
Sequential bottlenecks fundamentally limit parallel speedup. The inherently sequential portion (1-P) determines maximum achievable improvement, emphasizing the importance of identifying and optimizing critical sections.
Optimization prioritization should focus on frequently executed code regions. Optimizing code executed only 1% of the time yields negligible overall improvement even with dramatic speedup.
Benchmarking provides standardized performance measurement through representative program execution. The Standard Performance Evaluation Corporation (SPEC) develops and maintains widely used benchmark suites.
SPEC CPU benchmarks measure processor, memory, and compiler performance through compute-intensive workloads. SPEC CPU 2017 includes integer and floating-point suites with applications ranging from compression and compilation to molecular dynamics and weather simulation.
Benchmark methodology establishes consistent measurement procedures:
Base metrics measure performance with standard compiler flags, representing typical user configuration.
Peak metrics allow aggressive optimization, demonstrating maximum achievable performance.
Speed metrics measure how quickly a single task completes (lower is better for response time).
Rate metrics measure throughput for multiple concurrent tasks (higher is better for capacity).
SPECviewperf evaluates graphics performance through professional application traces. SPECworkstation addresses workstation-class systems with diverse workloads. SPECpower measures energy efficiency across server loads.
Benchmarking considerations include:
Representativeness: Benchmarks should reflect target workloads. No single benchmark suits all purposes.
Repeatability: Controlled conditions enable consistent measurements for comparison across systems.
Relevance: Benchmarks must remain current with evolving applications and hardware.
Gaming prevention: Benchmarks should resist optimization specifically targeting benchmark behavior without improving real workloads.
Contemporary computer design reflects several converging trends shaped by technology evolution, application demands, and economic factors.
Multi-core scaling has replaced clock frequency scaling as the primary performance improvement path. Processor manufacturers integrate increasing core counts, from dual-core through many-core designs with dozens of cores. This trend exploits transistor abundance while managing power density and heat dissipation.
Specialized accelerators complement general-purpose cores for specific workloads. Graphics processing units (GPUs) excel at parallel graphics and compute tasks. Tensor processing units (TPUs) accelerate machine learning inference and training. Digital signal processors (DSPs) handle communications and media processing. Field-programmable gate arrays (FPGAs) enable customizable acceleration.
Heterogeneous computing combines different processor types in unified systems. ARM's big.LITTLE architecture pairs high-performance and power-efficient cores. AMD's Accelerated Processing Units (APUs) integrate CPU and GPU on single chips. Apple's M-series chips combine CPU, GPU, neural engine, and other accelerators.
Memory hierarchy evolution addresses the growing processor-memory performance gap. Larger caches, sophisticated prefetching, and near-memory processing attempt to hide memory latency. High-bandwidth memory (HBM) and stacked DRAM provide increased bandwidth for data-intensive applications.
Power-aware design has become paramount as power density limits performance scaling. Dynamic voltage and frequency scaling (DVFS) adapts operation to workload demands. Clock gating and power gating disable unused components. Dark silicon (inactive chip area due to power constraints) motivates specialized accelerator deployment.
Security integration addresses growing threats through hardware security features. Trusted execution environments, memory encryption, and side-channel mitigations protect sensitive data and computation.
Moore's Law, articulated by Gordon Moore in 1965, observed that transistor density doubles approximately every 18-24 months. This empirical observation became a self-fulfilling prophecy driving semiconductor industry planning and investment.
Historical progression from 10-micrometer feature sizes in 1971 to 3-nanometer in 2022 represents a 3000× linear scaling factor and 9 million× area scaling. Transistor counts increased from 2,300 in Intel 4004 to over 50 billion in contemporary GPUs.
Scaling benefits included:
Performance improvement from faster transistors and reduced capacitance.
Power reduction per transistor enabling higher integration density.
Cost reduction per transistor through larger wafers and smaller features.
Dennard scaling (1974) observed that as transistors shrink, power density remains constant, enabling simultaneous frequency increases. This beneficial relationship ended around 2006 when current leakage and threshold voltage scaling limitations broke Dennard scaling.
Post-Moore era faces fundamental physical limits including:
Atomic scale barriers as feature sizes approach atomic dimensions (silicon atomic spacing ~0.5 nm). Quantum effects including tunneling become significant at these scales.
Economic challenges as fab costs exceed $10 billion for leading-edge facilities, limiting participation to few companies.
Power density constraints prevent full chip utilization, creating dark silicon and motivating specialized accelerators.
Continued innovation pursues alternative paths including:
3D integration stacking chips vertically for improved density and interconnect.
New materials such as III-V semiconductors, carbon nanotubes, and 2D materials.
Beyond-CMOS devices including spintronics, memristors, and quantum computing elements.
Architectural innovation exploiting parallelism, specialization, and heterogeneity to maintain performance scaling despite transistor scaling slowdown.
Digital computers represent information using binary digits (bits), but human communication traditionally uses decimal notation. Understanding number systems and conversions enables effective interaction with computing systems.
Decimal (base-10) uses ten digits (0-9) with positional weighting by powers of ten. The number 1234₁₀ represents 1×10³ + 2×10² + 3×10¹ + 4×10⁰.
Binary (base-2) uses two digits (0,1) with powers of two weighting. Binary 1011₂ represents 1×2³ + 0×2² + 1×2¹ + 1×2⁰ = 8 + 0 + 2 + 1 = 11₁₀.
Octal (base-8) uses eight digits (0-7) with powers of eight weighting. Octal 173₈ represents 1×8² + 7×8¹ + 3×8⁰ = 64 + 56 + 3 = 123₁₀. Octal conveniently groups binary digits in threes.
Hexadecimal (base-16) uses sixteen digits (0-9, A-F) with powers of sixteen weighting. A=10₁₀, B=11₁₀, C=12₁₀, D=13₁₀, E=14₁₀, F=15₁₀. Hexadecimal 7B₁₆ represents 7×16¹ + 11×16⁰ = 112 + 11 = 123₁₀. Hexadecimal conveniently groups binary digits in fours.
Binary to Decimal: Multiply each bit by its positional power of two and sum.
Example: 110101₂ = 1×2⁵ + 1×2⁴ + 0×2³ + 1×2² + 0×2¹ + 1×2⁰ = 32 + 16 + 0 + 4 + 0 + 1 = 53₁₀
Decimal to Binary: Repeated division by two, collecting remainders from least to most significant.
Example: Convert 53₁₀ to binary: 53 ÷ 2 = 26 remainder 1 (least significant) 26 ÷ 2 = 13 remainder 0 13 ÷ 2 = 6 remainder 1 6 ÷ 2 = 3 remainder 0 3 ÷ 2 = 1 remainder 1 1 ÷ 2 = 0 remainder 1 (most significant)
Reading remainders upward: 110101₂
Binary to Octal: Group binary digits in threes from right, convert each group to octal digit.
Example: 110101₂ grouped as 110 101 = 6 5 = 65₈
Binary to Hexadecimal: Group binary digits in fours from right, convert each group to hex digit.
Example: 110101₂ grouped as 11 0101, pad leading group: 0011 0101 = 3 5 = 35₁₆
Octal to Binary: Convert each octal digit to three binary bits.
Example: 65₈ = 6=110, 5=101 → 110101₂
Hexadecimal to Binary: Convert each hex digit to four binary bits.
Example: 35₁₆ = 3=0011, 5=0101 → 00110101₂
Fractional Conversions extend these principles to fractional parts using multiplication instead of division.
Representing negative numbers in binary requires encoding sign information within fixed-width bit patterns. Several representations offer different trade-offs.
The most significant bit (MSB) indicates sign (0=positive, 1=negative), remaining bits represent magnitude.
Representation: For n bits, values range from -(2ⁿ⁻¹-1) to +(2ⁿ⁻¹-1). 8-bit example: +42 = 00101010, -42 = 10101010.
Advantages: Simple conceptual understanding, symmetric range, easy negation by flipping sign bit.
Disadvantages: Two representations for zero (+0=00000000, -0=10000000) complicates arithmetic, addition/subtraction requires separate sign and magnitude handling, comparator design more complex.
Negative numbers represented by bitwise complement (inverting all bits) of positive magnitude.
Representation: For n bits, values range from -(2ⁿ⁻¹-1) to +(2ⁿ⁻¹-1). 8-bit example: +42 = 00101010, -42 = 11010101 (complement of 00101010).
Advantages: Negation through bitwise NOT, addition can use end-around carry, simpler arithmetic than sign-magnitude.
Disadvantages: Dual zero representation (+0=00000000, -0=11111111), end-around carry complicates adder design, not used in modern systems.
The dominant signed representation in modern computers. Negative numbers obtained by complementing positive and adding 1.
Representation: For n bits, values range from -2ⁿ⁻¹ to +(2ⁿ⁻¹-1). 8-bit example: +42 = 00101010, -42 = 11010110 (complement 11010101 + 1 = 11010110).
Key properties:
Unique zero: Only one representation (00000000) for zero.
Natural arithmetic: Addition and subtraction use same hardware as unsigned, with automatic modulo 2ⁿ behavior.
Sign extension: Replicating sign bit preserves value when extending width.
Range asymmetry: More negative values than positive (-128 to +127 for 8-bit) due to zero occupying positive space.
Negation: Compute two's complement (invert and add 1). Note that most negative number (-128 for 8-bit) negates to itself (overflow condition).
Example calculations:
Add +42 and -42: 00101010 (+42)
- 11010110 (-42) = 00000000 (with carry out 1, discarded in modulo arithmetic)
Add +42 and +42: 00101010
- 00101010 = 01010100 (+84, within range)
Add +100 and +50 (8-bit overflow): 01100100 (+100)
- 00110010 (+50) = 10010110 (-106 due to overflow)
Boolean algebra, developed by George Boole in the 19th century, provides mathematical foundation for digital logic design. Variables take only two values (0/1, false/true) with operations defined by truth tables.
Basic operations:
AND (· or ∧): Output 1 only when all inputs are 1. 0·0=0, 0·1=0, 1·0=0, 1·1=1
OR (+ or ∨): Output 1 when any input is 1. 0+0=0, 0+1=1, 1+0=1, 1+1=1
NOT (¬ or ¯ or '): Inverts input. ¬0=1, ¬1=0
Derived operations:
NAND: NOT(AND) - universal gate NOR: NOT(OR) - universal gate XOR: Exclusive OR - output 1 when inputs differ XNOR: Exclusive NOR - output 1 when inputs equal
Boolean laws enable expression simplification:
Identity: A·1 = A, A+0 = A Null: A·0 = 0, A+1 = 1 Idempotent: A·A = A, A+A = A Complement: A·Ā = 0, A+Ā = 1 Involution: ¬(¬A) = A Commutative: A·B = B·A, A+B = B+A Associative: (A·B)·C = A·(B·C), (A+B)+C = A+(B+C) Distributive: A·(B+C) = A·B + A·C, A + B·C = (A+B)·(A+C) DeMorgan's: ¬(A·B) = ¬A + ¬B, ¬(A+B) = ¬A·¬B Absorption: A + A·B = A, A·(A+B) = A
Canonical forms express any Boolean function as:
Sum of Products (SOP): OR of AND terms (minterms) Product of Sums (POS): AND of OR terms (maxterms)
Karnaugh maps provide graphical minimization for up to six variables, grouping adjacent 1s to eliminate variables.
Logic gates implement Boolean operations in hardware using transistors as switches. Standard gate symbols and truth tables define behavior.
Basic gates:
AND gate: Two or more inputs, output high only when all inputs high. Symbol: D-shaped body.
OR gate: Output high when any input high. Symbol: curved input side, pointed output.
NOT gate (inverter): Single input, output opposite. Symbol: triangle with bubble at output.
NAND gate: AND followed by NOT, output low only when all inputs high. Universal gate. Symbol: AND with output bubble.
NOR gate: OR followed by NOT, output high only when all inputs low. Universal gate. Symbol: OR with output bubble.
XOR gate: Output high when inputs differ. Symbol: OR with double-curved input side.
XNOR gate: Output high when inputs equal. Symbol: XOR with output bubble.
Buffer: Output equals input, used for signal restoration and fanout. Symbol: triangle.
Gate characteristics:
Propagation delay: Time from input change to output response, determines maximum operating frequency.
Fan-in: Number of inputs a gate can handle, limited by technology.
Fan-out: Number of gates a gate output can drive, limited by current capability.
Power consumption: Dynamic (switching) and static (leakage) components.
Noise margin: Tolerance to voltage variations while maintaining correct logic levels.
Gate-level design combines gates to implement complex functions, minimizing gate count, delay, or power as design goals dictate.
Combinational circuits produce outputs depending only on current inputs, without memory or feedback. Outputs are Boolean functions of inputs.
Half adder adds two bits, producing sum and carry: Sum = A ⊕ B Carry = A · B
Full adder adds three bits (two operands plus carry-in), producing sum and carry-out: Sum = A ⊕ B ⊕ Cin Cout = (A·B) + (Cin·(A⊕B))
Ripple-carry adder chains full adders for multi-bit addition. Simple but slow due to carry propagation through all bits.
Carry lookahead adder accelerates addition by computing carries in parallel using generate (G = A·B) and propagate (P = A⊕B) signals: C1 = G0 + P0·C0 C2 = G1 + P1·G0 + P1·P0·C0 C3 = G2 + P2·G1 + P2·P1·G0 + P2·P1·P0·C0
Carry select adder computes sums with both carry possibilities (0 and 1), selecting correct result when actual carry arrives.
Carry save adder produces partial sum and carry vectors for adding multiple numbers, used in multipliers.
Subtractor implemented using two's complement: A - B = A + (¬B + 1). Adding an initial carry-in of 1 to adder implements subtraction.
Multiplexer (MUX) selects one of several inputs to output based on select lines. 2ⁿ inputs require n select lines.
2:1 MUX: Output = (S·A) + (¬S·B) 4:1 MUX: Four inputs, two selects, implemented as two-level 2:1 MUX tree or directly with AND-OR logic.
Applications: Data routing, function implementation (MUX can implement any Boolean function by connecting inputs to constants or variables), resource sharing.
Demultiplexer (DEMUX) routes single input to one of several outputs based on select lines. Inverse of multiplexer.
1:4 DEMUX: Input appears on selected output, other outputs zero.
Decoder converts n-bit binary code to 2ⁿ one-hot outputs (exactly one output active). Each output corresponds to unique input combination.
3:8 decoder: Input 101 activates output 5 (assuming 0-indexed).
Applications: Memory address decoding, instruction decoding, demultiplexing, function implementation (decoder plus OR gates implements any Boolean function).
Priority encoder handles multiple active inputs by encoding highest-priority input. Used in interrupt controllers, arbitration circuits.
4:2 priority encoder with input priority I3>I2>I1>I0: Outputs indicate highest-priority active input, valid signal indicates any input active.
Magnitude comparator determines relationship between two binary numbers (equal, greater, less).
Equality comparator: XNOR each corresponding bit, AND all results.
Magnitude comparison: Compute A>B by finding most significant bit where A and B differ, checking if A has 1 and B has 0 at that position.
Iterative comparator cascades stages for multi-bit comparison, each stage passing comparison results to next.
Sequential circuits incorporate memory elements, producing outputs depending on current inputs and stored state (past inputs).
SR Latch (Set-Reset): Basic memory element using cross-coupled NOR or NAND gates.
- S=1, R=0: Set Q=1
- S=0, R=1: Reset Q=0
- S=0, R=0: Hold previous state
- S=1, R=1: Invalid (both outputs low for NOR implementation)
D Latch (Data/Transparent): Eliminates SR invalid state. When enable=1, Q follows D (transparent); when enable=0, Q holds last value.
Edge-triggered flip-flops sample inputs only at clock edges, eliminating transparency period.
D flip-flop: Q updates to D on clock edge (rising or falling). Most common storage element in synchronous design.
JK flip-flop: J=set, K=reset, J=K=1 toggles output. More versatile than D but more complex.
T flip-flop: Toggles output when T=1 at clock edge, useful for counters.
Flip-flop characteristics:
- Setup time: Data must be stable before clock edge
- Hold time: Data must remain stable after clock edge
- Clock-to-Q delay: Time from clock edge to output change
- Asynchronous inputs (preset/clear) override clock
Register: Collection of flip-flops sharing common clock, storing multi-bit data.
Parallel load register: Loads all bits simultaneously on clock edge.
Shift register: Shifts bits left or right each clock cycle, supporting serial-to-parallel conversion, delay lines, and arithmetic.
Universal shift register: Configurable for parallel load, shift left, shift right, and hold operations.
Register file: Array of registers with read/write ports, used in processor register banks.
Ripple counter: Cascaded flip-flops where each stage clocks next, simple but slow due to cumulative delays.
Synchronous counter: All flip-flops clocked simultaneously, faster but more complex.
Binary counter: Counts through binary sequence, used for timing and addressing.
BCD counter: Counts 0-9 decimal, wraps to 0, used in decimal applications.
Up/down counter: Counts in either direction based on control input.
Modulo-N counter: Counts 0 to N-1, wraps, used for frequency division.
Finite state machines (FSMs) model sequential systems with finite number of states, transitions between states based on inputs, and outputs depending on state (Moore) or state and inputs (Mealy).
FSM components:
- State register (flip-flops) storing current state
- Next state logic (combinational) determining next state
- Output logic (combinational) generating outputs
Moore machine: Outputs depend only on current state. Simpler design, outputs stable throughout state.
Mealy machine: Outputs depend on current state and inputs. May respond faster, potentially fewer states, but outputs may have glitches.
FSM design process:
- Identify states and transitions
- Draw state diagram
- Create state transition table
- Assign binary state encodings
- Derive next state and output equations
- Implement with flip-flops and logic gates
Common applications: Control units, protocol handlers, sequence detectors, vending machine controllers.
Programmable logic devices (PLDs) enable custom logic implementation without full custom fabrication.
PROM (Programmable Read-Only Memory): Fixed AND array (address decoder), programmable OR array. Implements any Boolean function in sum-of-products form.
PLA (Programmable Logic Array): Both AND and OR arrays programmable, more flexible than PROM but more complex.
PAL (Programmable Array Logic): Programmable AND array, fixed OR array. Faster and simpler than PLA, widely used in historical designs.
GAL (Generic Array Logic): Electrically erasable version of PAL, reprogrammable.
CPLD (Complex Programmable Logic Device): Multiple PAL-like blocks interconnected by programmable switch matrix, suitable for medium complexity designs.
Field-Programmable Gate Arrays (FPGAs) provide massive programmable logic capacity through configurable logic blocks and programmable interconnects.
FPGA architecture:
Configurable Logic Blocks (CLBs) contain lookup tables (LUTs) implementing small Boolean functions, flip-flops for sequential logic, and fast carry chains for arithmetic.
Lookup tables (typically 4-6 inputs) implement any Boolean function of their inputs by storing truth table in SRAM cells.
Programmable interconnects route signals between logic blocks through switch matrices and routing channels. Routing resources dominate FPGA area and determine achievable utilization.
I/O blocks interface with external signals, supporting various voltage standards and signaling rates.
Embedded blocks enhance capability:
- Block RAM for on-chip storage
- DSP slices for arithmetic (multiply-accumulate)
- Clock management (PLLs, DCMs)
- High-speed transceivers
- Processor cores (hard or soft)
Configuration technologies:
- SRAM-based: Volatile, reconfigurable, most common
- Flash-based: Non-volatile, less flexible
- Antifuse: One-time programmable, radiation-hardened
Design flow:
- Hardware description (VHDL/Verilog)
- Synthesis to gate-level netlist
- Technology mapping to LUTs and flip-flops
- Placement (assigning logic to specific locations)
- Routing (connecting placed elements)
- Bitstream generation and device programming
Applications: Rapid prototyping, low-volume production, hardware acceleration, telecommunications, signal processing, aerospace systems.
Fixed-point numbers represent real values with constant number of bits before and after binary point. Position of binary point is implicit, not stored.
Integer fixed-point: Binary point after least significant bit, representing only integers.
Fractional fixed-point: Binary point after most significant bit, representing values in [0,1) or symmetric ranges.
General fixed-point: Binary point at specified position, with integer part (I bits) and fractional part (F bits). Value = integer × 2⁻ᶠ.
Range and precision: I integer bits provide range up to 2ᴵ, F fractional bits provide precision of 2⁻ᶠ. Trade-off between range and precision.
Arithmetic operations: Addition/subtraction require operands aligned at same binary point position. Multiplication increases bit width (I₁+I₂ integer bits, F₁+F₂ fractional bits). Division may require careful scaling.
Applications: Digital signal processing, embedded systems where floating-point hardware too expensive or power-hungry, financial calculations requiring exact decimal representation.
Floating-point representation approximates real numbers with wide dynamic range by using scientific notation in binary: value = sign × significand × 2ᵉˣᵖᵒⁿᵉⁿᵗ.
IEEE 754 standardizes floating-point representation and arithmetic, ensuring portability across implementations.
Basic formats:
Single precision (32-bit):
- 1 bit sign (0 positive, 1 negative)
- 8 bits exponent (biased by 127)
- 23 bits significand (fraction part, with implicit leading 1)
Value = (-1)ˢⁱᵍⁿ × 1.fraction × 2⁽ᵉˣᵖᵒⁿᵉⁿᵗ⁻¹²⁷⁾
Double precision (64-bit):
- 1 bit sign
- 11 bits exponent (bias 1023)
- 52 bits significand
Special values:
Zero: exponent and fraction all zero. Two representations (+0 and -0) for signed zero.
Denormalized numbers: exponent zero, fraction non-zero. Represent very small values near zero, gradually underflowing.
Infinity: exponent all ones, fraction zero. Result of overflow or division by zero.
NaN (Not a Number): exponent all ones, fraction non-zero. Result of invalid operations (0/0, ∞-∞, sqrt(-1)).
Range and precision:
Single precision: ~1.4×10⁻⁴⁵ to 3.4×10³⁸, approximately 7 decimal digits precision.
Double precision: ~5.0×10⁻³²⁴ to 1.8×10³⁰⁸, approximately 16 decimal digits precision.
Rounding modes control how inexact results are represented:
- Round to nearest (ties to even) - default
- Round toward +∞
- Round toward -∞
- Round toward zero
Exceptions flag exceptional conditions: inexact, underflow, overflow, divide by zero, invalid operation.
Character encoding maps characters to binary codes for storage and transmission.
American Standard Code for Information Interchange (ASCII) uses 7 bits to represent 128 characters:
- Control characters (0-31, 127)
- Printable characters (32-126): digits, letters, punctuation
Extended ASCII uses 8 bits (256 characters) adding accented letters, line drawing, and symbols, but numerous incompatible extensions exist.
Unicode provides universal character encoding covering all world's writing systems, with over 140,000 characters.
Encoding forms:
UTF-32: Fixed 32-bit per character, simple but space-inefficient.
UTF-16: Variable length (16 or 32 bits per character), common in Windows and Java.
UTF-8: Variable length (8-32 bits per character), backward compatible with ASCII, dominant on web and in Unix-like systems.
UTF-8 encoding:
- ASCII characters (U+0000 to U+007F): single byte 0xxxxxxx
- U+0080 to U+07FF: 2 bytes 110xxxxx 10xxxxxx
- U+0800 to U+FFFF: 3 bytes 1110xxxx 10xxxxxx 10xxxxxx
- U+10000 to U+10FFFF: 4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Digital systems must detect and correct errors from noise, interference, or hardware faults.
Simple parity adds one bit to make total number of 1s even (even parity) or odd (odd parity). Detects single-bit errors, fails on even number of errors.
Two-dimensional parity arranges data in matrix, adding parity for rows and columns. Can detect and correct single-bit errors.
Hamming codes provide error correction capability through carefully placed parity bits covering overlapping subsets of data bits.
Hamming(7,4) code: 4 data bits, 3 parity bits. Parity bits at positions 1,2,4 (powers of two). Each parity bit covers data bits with that bit set in position number:
P1 covers bits 1,3,5,7 (positions with LSB=1) P2 covers bits 2,3,6,7 (positions with second bit=1) P4 covers bits 4,5,6,7 (positions with third bit=1)
Syndrome calculation: Recompute parity after transmission, bits where computed differs from received indicate error position.
Extended Hamming adds overall parity bit for double-error detection.
Cyclic Redundancy Check treats data as polynomial, computes remainder after division by generator polynomial. Highly effective for burst error detection.
CRC calculation: Data bits form polynomial coefficients, divided modulo-2 by generator polynomial, remainder is CRC.
Common generators:
- CRC-16: x¹⁶ + x¹⁵ + x² + 1
- CRC-32: x³² + x²⁶ + x²³ + x²² + x¹⁶ + x¹² + x¹¹ + x¹⁰ + x⁸ + x⁷ + x⁵ + x⁴ + x² + x + 1
Properties: Detects all single-bit errors, all double-bit errors (with appropriate generator), all odd number of errors, burst errors up to polynomial degree.
Addition/Subtraction: Two's complement addition uses same hardware for signed and unsigned, with overflow detection depending on interpretation.
Overflow detection for signed addition:
- Adding two positives yields negative → overflow
- Adding two negatives yields positive → overflow
- Otherwise no overflow
Hardware detects overflow when carry into sign bit ≠ carry out of sign bit.
Multiplication: n-bit × n-bit produces 2n-bit product.
Array multiplier implements grade-school algorithm with AND gates for partial products and adder array for summation. Regular structure but O(n²) gates.
Wallace tree multiplier reduces partial products using carry-save adders in tree structure, then final carry-propagate adder. Faster than array but irregular.
Booth multiplication (covered separately) handles signed numbers efficiently.
Division: More complex than multiplication, typically implemented through iterative algorithms.
Restoring division: Subtract divisor, if negative restore dividend, shift and continue.
Non-restoring division: Allow negative remainders, correct later, eliminating restore steps.
Booth's algorithm efficiently multiplies signed two's complement numbers by encoding multiplier to reduce partial products.
Basic Booth: Examines multiplier bits in pairs with overlap, recoding each pair to operations:
- 00: shift
- 01: add multiplicand
- 10: subtract multiplicand
- 11: shift
Radix-4 Booth (modified Booth) examines 3 bits at a time with 1-bit overlap, reducing partial products by half. Recoding table:
000: 0 × multiplicand 001: +1 × multiplicand 010: +1 × multiplicand 011: +2 × multiplicand 100: -2 × multiplicand 101: -1 × multiplicand 110: -1 × multiplicand 111: 0 × multiplicand
Advantages: Handles signed numbers naturally, reduces partial products for higher radix, works with two's complement.
Restoring division algorithm:
Initialize remainder = dividend (n bits) For i = 0 to n-1: remainder = remainder << 1 if remainder ≥ divisor: remainder = remainder - divisor quotient bit = 1 else: quotient bit = 0
Non-restoring division:
Initialize remainder = dividend For i = 0 to n-1: if remainder ≥ 0: remainder = (remainder << 1) - divisor quotient bit = 1 else: remainder = (remainder << 1) + divisor quotient bit = 0
Final correction step if remainder negative.
SRT division uses redundant quotient digit set (-1,0,1) and lookup table for quotient selection, enabling higher radix and faster operation.
Floating-point operations require handling significands and exponents separately, with normalization steps.
Addition/Subtraction:
- Align significands by shifting smaller exponent number right
- Add/subtract significands (including implicit leading 1)
- Normalize result (shift left until leading 1, adjust exponent)
- Round to fit precision
Multiplication:
- Add exponents (subtract bias once)
- Multiply significands (including implicit 1)
- Normalize product
- Round
- Determine sign (XOR of operand signs)
Division:
- Subtract exponents (add bias)
- Divide significands
- Normalize
- Round
- Determine sign
Challenges:
- Gradual underflow with denormalized numbers
- Exact rounding requires extra precision (guard, round, sticky bits)
- Exception handling for special values
- Fused multiply-add for accuracy and performance
ALU combines arithmetic and logical operations in single functional unit.
Typical operations:
- Arithmetic: ADD, SUB, ADC (add with carry), SBB (subtract with borrow)
- Logical: AND, OR, XOR, NOT
- Shift: logical left/right, arithmetic right shift
- Comparison: set flags (zero, carry, overflow, negative)
Flag generation:
- Zero: result all zeros
- Carry: unsigned overflow (carry out from MSB)
- Overflow: signed overflow (carry into MSB ≠ carry out)
- Negative: MSB of result (signed interpretation)
- Parity: XOR of result bits (even/odd)
Design approaches:
- Single ALU performing all operations with multiplexer selection
- Multiple functional units with result selection
- Pipelined ALU for higher throughput
Carry lookahead accelerates addition by computing carries in parallel using generate (G) and propagate (P) signals.
Definitions: G_i = A_i · B_i (generate carry from this position) P_i = A_i ⊕ B_i (propagate carry through this position)
Carry computation: C_i+1 = G_i + P_i · C_i
Expanding for 4-bit adder: C1 = G0 + P0·C0 C2 = G1 + P1·G0 + P1·P0·C0 C3 = G2 + P2·G1 + P2·P1·G0 + P2·P1·P0·C0 C4 = G3 + P3·G2 + P3·P2·G1 + P3·P2·P1·G0 + P3·P2·P1·P0·C0
Hierarchical carry lookahead combines blocks to extend to wider words while managing fan-in.
Carry skip adder detects when all P_i in block are 1, allowing carry to skip block.
Overflow occurs when result exceeds representable range.
Unsigned overflow: Carry out from MSB indicates result too large for given width.
Signed overflow: Occurs when adding numbers with same sign yields opposite sign, or subtracting numbers with opposite signs yields sign equal to subtrahend.
Detection methods:
- Compare carry into and out of sign bit
- Check sign of operands and result
- Use saturation arithmetic (clamp at maximum/minimum)
Saturation arithmetic useful in multimedia and DSP, preventing wrap-around by limiting result to representable range.
Pipelined multipliers overlap partial product accumulation across multiple cycles.
Floating-point fused multiply-add (FMA) computes A×B+C with single rounding, improving accuracy and performance for dot products and polynomial evaluation.
Multiplier accumulation (MAC) units combine multiplication and addition for DSP applications.
Residue Number Systems represent numbers using multiple moduli, enabling parallel arithmetic without carries between moduli.
Redundant representations (carry-save, signed-digit) eliminate carry propagation during intermediate computations.
Reduced Instruction Set Computer (RISC) philosophy emphasizes simple, regular instructions that execute in single cycle.
RISC characteristics:
- Fixed instruction length (typically 32 bits)
- Load-store architecture (only loads/stores access memory)
- Large register file (32 or more general-purpose registers)
- Simple addressing modes
- Few instruction formats
- Hardwired control (minimal microcode)
- Pipelined execution optimized
Advantages: Simpler design, easier pipelining, compiler optimization opportunities, lower power consumption.
Examples: ARM, MIPS, RISC-V, PowerPC, SPARC
Complex Instruction Set Computer (CISC) provides powerful instructions that perform multiple operations.
CISC characteristics:
- Variable instruction length
- Memory operands allowed in many instructions
- Fewer registers (typically 8-16)
- Many addressing modes
- Complex instruction formats
- Microprogrammed control common
- Instructions may take multiple cycles
Advantages: Smaller code size, simpler compilers (historically), backward compatibility important.
Examples: x86, 68000, VAX
Modern convergence: Contemporary x86 processors decode CISC instructions into RISC-like micro-operations for internal execution.
Instruction format defines how instruction bits encode operation, operands, and addressing information.
Common fields:
- Opcode: Specifies operation to perform
- Register specifiers: Source and destination registers
- Address/displacement: Memory address or offset
- Immediate constant: Constant value embedded in instruction
Format types:
Three-address: op dest, src1, src2 (RISC) Example: ADD R1, R2, R3 (R1 = R2 + R3)
Two-address: op dest, src (dest also source) Example: ADD R1, R2 (R1 = R1 + R2)
One-address: accumulator machine Example: ADD X (ACC = ACC + mem[X])
Zero-address: stack machine Example: ADD (pop top two, push result)
Variable-length formats (CISC) pack different information for different instructions, saving code space but complicating decoding.
Fixed-length formats (RISC) simplify decoding and pipelining but may waste space for simple instructions.
Addressing modes specify how to compute effective address of operands.
Immediate: Operand value in instruction Example: ADD R1, #42 (R1 = R1 + 42)
Register direct: Operand in register Example: ADD R1, R2
Register indirect: Register contains address of operand Example: LOAD R1, (R2) (R1 = mem[R2])
Displacement: Address = register + offset Example: LOAD R1, 8(R2) (R1 = mem[R2+8])
Indexed: Address = base register + index register Example: LOAD R1, (R2+R3)
Absolute/direct: Address in instruction Example: LOAD R1, (1000)
PC-relative: Address = PC + offset Common for branches and position-independent code
Auto-increment/decrement: Register updated after/before use Example: LOAD R1, (R2)+ (R1 = mem[R2]; R2 = R2 + 4)
Scaled index: Address = base + (index × scale) Example: LOAD R1, (R2 + R3×4) for array access
Data movement: LOAD, STORE, MOVE
Arithmetic: ADD, SUB, MUL, DIV
Logical: AND, OR, XOR, NOT
Shift/rotate: SHL, SHR, SAR, ROL, ROR
Control transfer:
- Conditional branches (BEQ, BNE, etc.)
- Unconditional jumps (JMP)
- Subroutine call/return (CALL, RET)
Compare and test: CMP, TEST (set flags)
System instructions: HALT, interrupts, privileged operations
Floating-point: FADD, FMUL, etc.
SIMD/vector: Packed operations on multiple data elements
Stack machines: Operations use top of stack implicitly
- Instructions shorter (no operand specifiers)
- Simple compilers
- Stack depth limits performance
- Examples: Java Virtual Machine, Burroughs B5000
Accumulator machines: One implicit accumulator register
- Simple hardware
- Bottleneck for performance
- Examples: Early microprocessors
General-purpose register machines: Explicit register operands
- Compiler flexibility for optimization
- Register allocation critical
- Most common modern approach
x86 architecture, originating with 8086 in 1978, dominates personal computing and servers through backward compatibility and evolutionary enhancement.
Historical development:
- 8086 (1978): 16-bit, segmented addressing
- 80386 (1985): 32-bit, paging, protected mode
- Pentium (1993): Superscalar execution
- x86-64 (2003): 64-bit extension by AMD, adopted by Intel
Register set evolved from 8 general-purpose registers (AX, BX, CX, DX, SI, DI, BP, SP) with various specializations. 64-bit extension adds 8 more registers (R8-R15).
Addressing modes include register indirect, displacement, base+index, scaled index, and combinations.
Instruction formats variable length (1-15 bytes) with complex encoding. Prefix bytes modify operation (repeat, lock, operand/address size).
CISC characteristics: Memory operands in arithmetic, string instructions, stack-oriented CALL/RET, condition codes.
Modern implementation: Decode to RISC-like micro-ops, out-of-order execution, register renaming.
ARM architecture, developed by Acorn Computers in 1980s, dominates mobile and embedded systems through power efficiency and clean design.
Architecture versions:
- ARMv4/v5: Classic ARM
- ARMv6: SIMD extensions
- ARMv7: Thumb-2, NEON
- ARMv8: 64-bit (AArch64)
Key features:
- 16 or 32 general-purpose registers
- Load-store architecture
- Conditional execution on most instructions
- Barrel shifter integrated with arithmetic
- Thumb instruction set (16-bit) for code density
Conditional execution: Instructions execute only if condition flags (from previous compare) satisfied, reducing branch instructions.
Addressing modes: Register indirect with optional shift, pre/post-increment/decrement, PC-relative.
Thumb-2: Mix of 16 and 32-bit instructions balancing code density and performance.
RISC-V, developed at UC Berkeley, provides open, free ISA specification enabling academic and commercial implementation without licensing.
Base integer instruction sets:
- RV32I: 32-bit, 40 instructions
- RV64I: 64-bit extension
- RV128I: 128-bit (future)
Standard extensions:
- M: Integer multiplication and division
- F: Single-precision floating-point
- D: Double-precision floating-point
- A: Atomic operations
- C: Compressed (16-bit) instructions
- V: Vector operations
Modular design: Implementations can choose subset of extensions, enabling ultra-small embedded to high-performance applications.
Privileged architecture: Machine, supervisor, and user modes with control and status registers.
Open ecosystem: Free specification, multiple open-source implementations, growing software support.
CPU (Central Processing Unit) orchestrates instruction execution through coordinated operation of datapath and control.
Major components:
- Datapath: Registers, ALU, buses for data movement
- Control unit: Generates control signals sequencing operations
- Memory interface: Connects to cache and main memory
- I/O interface: Communicates with peripheral devices
Functional view: CPU repeatedly executes instruction cycle: fetch instruction from memory, decode, execute, possibly write results.
Registers provide fastest storage, directly accessible to instructions.
Programmer-visible registers:
- General-purpose registers: Hold operands and results
- Special-purpose registers: PC (program counter), SP (stack pointer), status/flag register
Internal registers (not programmer-visible):
- Instruction register (IR): Holds current instruction
- Memory data register (MDR): Data to/from memory
- Memory address register (MAR): Address for memory access
- Temporary registers: Hold intermediate values
Register file implementation:
- Multi-ported SRAM array for simultaneous reads/writes
- Read ports for operand fetch
- Write port for result storage
- Bypass paths to avoid read-after-write hazards
Datapath connects registers, ALU, and memory interface through buses, enabling data movement and transformation.
Single-bus datapath: All components connect to common bus. Simple but limited to one transfer per cycle.
Multi-bus datapath: Multiple buses enable parallel transfers. Common configurations:
- Two input buses to ALU, one output bus
- Dedicated buses for register file read ports
RISC datapath elements:
- Register file with multiple read/write ports
- ALU with operand registers
- Shifter (often integrated with ALU)
- Memory address calculator
- PC incrementer and branch adder
Datapath timing: Clock cycle determines when state elements update. Combinational logic between registers must settle within cycle.
Control unit generates signals directing datapath operations based on instruction decoding and sequencing.
Control signals generated directly by combinational logic from instruction opcode and current state.
Design process:
- Define control signals needed for each instruction
- Create state diagram for instruction execution
- Derive Boolean equations for each control signal
- Implement with logic gates
Advantages: Fast operation, suitable for RISC with simple instruction sets.
Disadvantages: Complex for large instruction sets, difficult to modify, design effort increases with instruction count.
Control signals stored in control memory (microcode), sequenced by microprogram counter.
Microinstruction format:
- Control signals for datapath (horizontal microcode)
- Next microaddress generation
- Condition select for branching
Microprogram sequencing:
- Sequential execution by default
- Conditional branches based on status flags
- Subroutine calls for common sequences
- Return to microcode dispatcher after instruction completion
Advantages: Flexible, easier to modify (change microcode), handles complex instructions naturally.
Disadvantages: Slower than hardwired (control memory access), microcode design complexity.
Contemporary use: Modern CISC processors often use combination: hardwired for common operations, microcode for complex instructions.
Instruction execution proceeds through sequential phases:
Fetch:
- Place PC on address bus
- Read instruction from memory into IR
- Increment PC (or compute next PC for branches)
Decode:
- Interpret opcode and addressing modes
- Read source registers from register file
- Compute effective addresses if needed
- Generate control signals for execute phase
Execute:
- Perform ALU operation
- Calculate branch target
- Access memory (load/store)
- Update condition flags
Write-back:
- Write result to destination register
- Update PC for taken branches
- Handle exceptions if occurred
Interrupt handling interleaved between instructions or at instruction boundaries.
Interrupts alter normal instruction flow to respond to asynchronous events.
Interrupt types:
- Hardware interrupts: External devices requesting service
- Software interrupts/traps: Programmed requests (system calls)
- Exceptions: Error conditions (divide by zero, page fault)
Interrupt handling process:
- Complete current instruction (or abort if precise exceptions)
- Save PC and processor state
- Identify interrupt source (vectored or polled)
- Jump to interrupt service routine (ISR)
- Execute ISR
- Restore state and return to interrupted program
Interrupt priorities: Higher-priority interrupts can preempt lower-priority handlers.
Interrupt masking: Disable interrupts during critical sections.
Exceptions are synchronous events caused by instruction execution.
Precise exceptions: Machine state saved such that interrupted instruction can be restarted after handling. Critical for virtual memory and debugging.
Imprecise exceptions: State not fully recoverable, used in some high-performance machines where precise handling too expensive.
Exception handling sequence:
- Detect exception condition during execution
- Abort or complete current instruction appropriately
- Save exception cause and program counter
- Vector to exception handler
- Handler resolves condition (e.g., page fault brings in page)
- Return to faulting instruction for retry
Pipelining overlaps instruction execution by dividing processing into stages, each handling different instruction simultaneously. Analogous to assembly line: while one instruction executes, next decodes, next fetches.
Throughput improvement: Ideal pipeline with n stages approaches n× throughput of single-cycle implementation, though latency for individual instruction increases slightly.
Pipeline depth trade-offs: Deeper pipelines enable higher clock frequencies but increase hazards and branch misprediction penalties.
Classic RISC pipeline (5 stages):
IF (Instruction Fetch):
- Access instruction cache using PC
- Update PC for sequential execution
ID (Instruction Decode):
- Decode instruction
- Read register operands
- Generate control signals
- Compute branch target (simplified)
EX (Execute):
- ALU operation
- Effective address calculation for memory access
- Condition evaluation for branches
MEM (Memory Access):
- Load/store to data cache
- Complete branch resolution
WB (Write Back):
- Write result to register file
Pipeline registers between stages hold intermediate results and control information.
Hazards prevent next instruction from executing in following clock cycle, causing stalls.
Hardware resources insufficient to support all possible pipeline combinations.
Examples:
- Single memory for instructions and data (von Neumann bottleneck)
- Single register file write port when two instructions need write-back
- Floating-point unit shared among multiple pipelines
Resolution:
- Add hardware (separate caches, more ports)
- Stall pipeline until resource available
Instruction depends on result of previous not yet available.
Types:
Read After Write (RAW): True dependency I1: ADD R1, R2, R3 I2: SUB R4, R1, R5 (needs R1 from I1)
Write After Read (WAR): Anti-dependency I1: ADD R1, R2, R3 I2: SUB R2, R4, R5 (writes R2 before I1 reads it)
Write After Write (WAW): Output dependency I1: ADD R1, R2, R3 I2: SUB R1, R4, R5 (both write R1)
Resolution:
- Forwarding/bypass (pass result directly from EX to next EX)
- Stalling (insert bubbles) when forwarding insufficient
- Register renaming for WAR/WAW
Branch instructions change PC, potentially fetching wrong instructions.
Branch penalty: Wrong-path instructions must be flushed.
Resolution:
- Branch prediction (guess direction)
- Delayed branches (execute instruction in branch delay slot)
- Predicated execution (conditional instructions)
Hazard detection unit monitors instructions in pipeline, comparing register specifiers to detect dependencies.
Detection logic for RAW hazards: if (ID/EX.RegisterRd != 0) and (ID/EX.RegisterRd == IF/ID.RegisterRs1) then stall
Stall insertion:
- Freeze PC (stop IF)
- Insert bubble in ID/EX register (set control to 0)
- Continue after hazard resolved
Forwarding paths route results from later stages to earlier stage inputs:
- EX/MEM result to EX input (most common)
- MEM/WB result to EX input
- MEM result to MEM input (for load-use)
Load-use hazard: Load instruction result needed by next instruction: I1: LW R1, 0(R2) I2: ADD R3, R1, R4
Result from load available after MEM stage, but I2 needs it in EX. Requires one stall cycle even with forwarding.
Forwarding implementation:
- Multiplexers at ALU inputs select from:
- Register file output
- EX/MEM result
- MEM/WB result
- Control logic selects appropriate source based on hazard detection
Branch prediction guesses branch direction to avoid stalls.
Static prediction:
- Always predict taken/not-taken
- Backward taken, forward not-taken (loop heuristic)
- Compiler hint bit
Dynamic prediction uses branch history:
1-bit predictor: Remember last outcome, predict same next time. Fails for loops (miss on first and last iteration).
2-bit saturating counter: Four states (strongly not-taken, weakly not-taken, weakly taken, strongly taken). Better for loops, requires 2 mispredictions to change.
Branch history table (BHT): Array of 2-bit counters indexed by branch address lower bits.
Correlating predictors combine branch history with branch address:
- (m,n) predictor: m-bit global history, n-bit counter per history pattern
- Captures correlation between consecutive branches
Tournament predictor: Combines multiple predictors, selects best based on recent performance (Alpha 21264).
Branch target buffer (BTB): Cache of previously taken branches storing target address, avoiding target computation delay.
Return address stack (RAS): Special predictor for subroutine returns, pushing return address on call, popping on return.
Superscalar processors issue multiple instructions per cycle, exploiting instruction-level parallelism (ILP).
Issue width: Number of instructions that can be issued simultaneously (2-way, 4-way, etc.)
In-order superscalar:
- Fetch multiple instructions
- Check dependencies among them
- Issue independent instructions to functional units
- Maintain program order for dependent instructions
Out-of-order superscalar:
- Decode instructions into reservation stations
- Execute when operands ready (register renaming handles name dependencies)
- Reorder completion for precise exceptions
Functional units:
- Multiple ALUs
- Multiple load/store units
- Floating-point units
- Branch units
Resource requirements:
- Multi-ported register file
- Multiple memory access ports
- Complex bypass network
- Instruction queue/dispatch logic
Out-of-order execution maximizes ILP by executing instructions as soon as operands available, regardless of original program order.
Key components:
Register renaming eliminates WAR/WAW hazards by mapping architectural registers to physical registers. Each write creates new physical register mapping.
Reorder buffer (ROB) maintains program order for completion:
- Instructions enter ROB in order
- Results written to ROB when execution completes
- Instructions commit to architectural state in order
- Enables precise exceptions
Reservation stations buffer instructions waiting for operands:
- Hold instruction and operands (or tags for pending operands)
- Monitor common data bus for results
- Issue to functional unit when all operands ready
Issue logic selects ready instructions for execution:
- Wakeup: Identify instructions whose operands become ready
- Select: Choose among ready instructions for available units
Common data bus (CDB) broadcasts results to all reservation stations and ROB, enabling wakeup.
Speculative execution allows execution beyond unresolved branches:
- State maintained in ROB
- Flush incorrect path on misprediction
- Branch prediction critical for performance
Memory disambiguation handles load/store ordering:
- Loads may execute before preceding stores if addresses known
- Store forwarding bypasses store value to dependent load
- Mispredicted memory order requires recovery
Memory hierarchy exploits locality to approximate speed of fastest memory at cost of slowest.
Temporal locality: Recently accessed items likely accessed again soon (loops, reuse).
Spatial locality: Items near recently accessed likely accessed soon (arrays, sequential code).
Hierarchy levels (fast to slow, small to large, expensive to cheap):
- Registers
- L1 cache (typically 16-64KB)
- L2 cache (256KB-1MB)
- L3 cache (2-32MB)
- Main memory (DRAM, 4GB-1TB)
- Solid-state disk (SSD, 128GB-4TB)
- Magnetic disk (HDD, 500GB-20TB)
Principle: Copy frequently used data from slower to faster levels automatically (hardware managed caches) or under software control (virtual memory).
Cache stores subset of main memory contents for faster access. Cache operation transparent to programmer.
Cache organization parameters:
- Capacity (C): total data bytes
- Block/line size (B): bytes transferred on miss
- Associativity (A): number of possible locations per address
Address breakdown:
- Block offset: selects byte within block (log₂ B bits)
- Index: selects cache set (log₂ (number of sets) bits)
- Tag: compared to identify block
Each memory block maps to exactly one cache location.
Mapping: (block address) mod (number of cache blocks)
Advantages: Simple, fast access (single tag comparison), low hardware cost.
Disadvantages: Conflict misses when multiple frequently used blocks map to same location.
Example: 64KB cache, 16-byte blocks, 4096 blocks. Memory address bits: 4 offset, 12 index, remaining tag.
Any memory block can occupy any cache location.
Advantages: No conflict misses, optimal utilization.
Disadvantages: Complex hardware (compare tag with all entries), expensive for large caches.
Replacement policy needed: Choose which block to evict.
Compromise: cache divided into sets, each containing multiple blocks. Memory block maps to specific set, can occupy any block within that set.
n-way set associative: Each set has n blocks.
Address mapping: (block address) mod (number of sets) determines set.
Advantages: Reduces conflict misses compared to direct-mapped, simpler than fully associative.
Trade-off: Higher associativity reduces misses but increases access time and complexity.
When cache miss occurs in associative cache, choose which block to evict.
Random: Simple hardware, unpredictable performance.
FIFO (First-In, First-Out): Evict oldest block. Simple, but may evict frequently used blocks.
LRU (Least Recently Used): Evict block unused longest. Good locality exploitation, complex for high associativity.
Pseudo-LRU: Approximate LRU with less hardware (e.g., NRU - Not Recently Used).
LFU (Least Frequently Used): Evict least accessed block. May keep old blocks with high frequency counts.
Belady's optimal: Evict block used furthest in future. Theoretical maximum, unrealizable.
Write-through: Write to cache and memory simultaneously.
- Simple, memory always consistent
- High bandwidth usage, slow
Write-back: Write only to cache, mark dirty, write to memory when evicted.
- Reduces memory traffic
- Complex (dirty bits, writeback on eviction)
Write allocate: On write miss, load block into cache, then write.
No-write allocate: Write directly to memory, bypass cache.
Common combinations:
- Write-through with no-write allocate
- Write-back with write allocate
Multiple cache levels hide main memory latency.
Inclusive cache: Higher level contains subset of lower level. Simplifies coherence but wastes capacity.
Exclusive cache: Block present in at most one level. Better capacity utilization but more complex.
Typical hierarchy:
- L1: Small, fast, separate instruction/data
- L2: Medium, unified
- L3: Large, unified, shared among cores
Miss penalties: L1 miss (~10 cycles) to L2, L2 miss (~30 cycles) to L3, L3 miss (~200 cycles) to memory.
In multi-core systems, caches may contain stale copies when one core modifies shared data.
Coherence protocols maintain consistency:
Snooping protocols: All caches monitor bus transactions.
- MSI protocol: Modified, Shared, Invalid states
- MESI (Illinois) protocol: adds Exclusive state (clean, only in this cache)
- MOESI: adds Owned state (modified but shared)
Directory-based protocols: Central directory tracks cache copies.
- Scalable to many cores
- Directory stores presence bits
- Messages sent only to relevant cores
False sharing: Different cores access different data in same cache block, causing unnecessary coherence traffic.
Virtual memory provides illusion of large, contiguous address space using physical memory and disk storage.
Page: Fixed-size unit of memory (typically 4KB)
Page table: Maps virtual pages to physical frames
Page fault: Access to page not in physical memory triggers OS intervention to load from disk
Page table entry (PTE) contains:
- Physical frame number
- Present bit (in memory?)
- Dirty bit (modified?)
- Accessed bit (referenced?)
- Permission bits (read/write/execute)
- Caching attributes
Multi-level page tables reduce memory overhead for sparse address spaces:
- Page directory points to page tables
- Only allocate page tables for used regions
Variable-sized memory regions with base and limit registers.
Segmentation advantages:
- Natural protection boundaries
- Supports sparse address spaces
- Simplifies sharing
Segmentation disadvantages:
- External fragmentation
- Complex address translation
- Limited segment count
Many systems combine paging and segmentation (x86 protected mode).
When memory full, select page to evict to disk.
FIFO: Evict oldest page. Simple but may evict frequently used pages (Belady's anomaly: more frames can increase misses).
Second chance (clock algorithm): FIFO with reference bit. Give pages second chance if referenced recently.
LRU: Evict least recently used. Good performance, expensive to implement exactly.
Approximate LRU: Use reference bits and aging.
Working set: Pages currently in use. Thrashing occurs when working set exceeds available memory.
Small, fast cache for page table entries, avoiding memory access for each translation.
TLB operation:
- Virtual address presented to TLB
- If hit, physical frame retrieved directly
- If miss, hardware or software walks page tables, loads entry into TLB
TLB coverage: TLB entries × page size. Large pages (huge pages) increase coverage.
TLB misses expensive (10-100 cycles). TLB miss handlers:
- Hardware walk: dedicated state machine
- Software walk: OS routine (simpler hardware, more flexible)
Protection mechanisms prevent unauthorized access.
Protection bits per page/segment:
- Read/write/execute permissions
- User/supervisor mode distinction
Protection faults trap to OS when violations occur.
ASLR (Address Space Layout Randomization): Randomize memory layout to thwart exploits.
Hardware component performing address translation and protection checks.
MMU components:
- TLB
- Page table walker
- Protection logic
- Cache control attributes
Address translation flow:
- Virtual address from CPU
- TLB lookup
- If miss, walk page tables (hardware or software)
- Check permissions
- Generate physical address for cache/memory
Traditional rotating storage using magnetic media.
Components:
- Platters: Circular disks coated with magnetic material
- Spindle: Rotates platters (5400-15000 RPM)
- Heads: Read/write data, one per surface
- Actuator: Moves heads across platters
Performance parameters:
- Seek time: Move heads to cylinder (3-15 ms)
- Rotational latency: Wait for sector under head (half rotation time)
- Transfer time: Read/write data (depends on density and rotation)
Addressing: CHS (cylinder, head, sector) or LBA (logical block addressing)
Redundant Array of Independent Disks combines multiple disks for performance and reliability.
RAID levels:
RAID 0 (striping): Data striped across disks, no redundancy. Best performance, no fault tolerance.
RAID 1 (mirroring): Data duplicated on two disks. Good read performance, 50% capacity overhead.
RAID 5: Block-level striping with distributed parity. Tolerates single disk failure, good read performance, write penalty (read-modify-write for parity).
RAID 6: Dual parity, tolerates two failures. More write overhead.
RAID 10 (1+0): Mirrored stripes. Combines RAID 0 performance with RAID 1 redundancy.
Flash-based storage with no moving parts.
NAND flash types:
- SLC (Single-Level Cell): 1 bit/cell, fastest, most durable
- MLC (Multi-Level Cell): 2 bits/cell, good balance
- TLC (Triple-Level Cell): 3 bits/cell, cheaper, slower
- QLC (Quad-Level Cell): 4 bits/cell, highest capacity, lowest endurance
Characteristics:
- No seek time or rotational latency
- Fast random access
- Limited write cycles (wear leveling required)
- Read/write asymmetry (reads faster)
- Erase before write (garbage collection)
Performance metrics: Sequential/random read/write, IOPS, latency
SATA (Serial ATA): Traditional HDD/SSD interface
- Up to 600 MB/s (SATA 3.0)
- AHCI protocol (designed for HDDs)
- Command queue depth limited
NVMe (Non-Volatile Memory Express): Designed for SSDs
- Direct PCIe connection (multiple lanes)
- Up to 8 GB/s (PCIe 4.0 x4)
- High queue depth (64K queues, 64K commands/queue)
- Low latency, parallelism
Form factors: 2.5-inch, M.2, U.2, add-in cards
File systems organize and manage storage.
Key concepts:
- Files: Named data containers
- Directories: Hierarchical organization
- Metadata: File attributes (size, permissions, timestamps)
- Allocation: How file data stored on disk
Common structures:
- FAT: File allocation table (simple, legacy)
- NTFS: Journaling, security, compression
- ext4: Linux standard, extents, journaling
- APFS: Apple's modern file system
Journaling: Records pending changes for recovery after crash.
I/O interfaces connect peripheral devices to computer system.
Interface functions:
- Data transfer
- Status reporting
- Command interpretation
- Address decoding
- Protocol conversion
I/O addressing:
- Memory-mapped I/O: Devices appear in memory address space
- Isolated I/O: Separate I/O address space, special instructions
CPU directly controls data transfers, checking device status and moving each byte.
Process:
- CPU reads device status
- If ready, CPU reads/writes data
- Repeat for next byte
Advantages: Simple, no special hardware.
Disadvantages: CPU busy during transfers, poor for high-speed devices.
Device interrupts CPU when ready, allowing overlap of I/O and computation.
Process:
- CPU initiates I/O operation
- CPU continues other work
- Device interrupts when ready
- ISR handles data transfer
- Return to interrupted program
Interrupt handling overhead: Context save/restore, ISR execution.
DMA controller transfers data directly between memory and device without CPU involvement.
DMA operation:
- CPU programs DMA controller (source, destination, count)
- DMA controller transfers data
- DMA interrupts CPU when complete
Cycle stealing: DMA controller takes bus cycles when CPU not using them.
Scatter-gather DMA: Handles non-contiguous memory regions using descriptor lists.
Serial point-to-point interconnect dominating modern systems.
Architecture:
- Root complex connects CPU/memory
- Switches create fan-out
- Endpoints are devices
- Lanes (1, 2, 4, 8, 16) per link
Transaction types: Memory read/write, I/O, configuration, messages
Performance: Each lane ~1 GB/s per direction (PCIe 4.0), aggregate bandwidth scales with lanes
Features: Hot plug, power management, error handling
Universal Serial Bus for peripheral connection.
Versions:
- USB 2.0: 480 Mbps
- USB 3.0: 5 Gbps (SuperSpeed)
- USB 3.1: 10 Gbps
- USB 3.2: 20 Gbps
- USB4: 40 Gbps (Thunderbolt compatible)
Topology: Tiered star with root hub, hubs, devices
Transfer types:
- Control: Configuration
- Bulk: Large data transfers
- Interrupt: Periodic polling
- Isochronous: Real-time (audio/video)
I/O devices appear as memory locations, accessed with normal load/store instructions.
Advantages:
- No special I/O instructions
- All memory access methods available (caching, protection)
- Unified programming model
Considerations:
- Address space consumption
- Caching must be disabled for device memory
- Memory ordering constraints
Classification of parallel architectures based on instruction and data streams.
SISD (Single Instruction, Single Data): Traditional uniprocessor
- One instruction stream, one data stream
- Example: Single-core processor
SIMD (Single Instruction, Multiple Data): Same operation on multiple data elements
- Vector processors, GPU cores
- Example: Intel MMX/SSE/AVX, ARM NEON
MISD (Multiple Instruction, Single Data): Rare
- Multiple instructions on same data
- Example: Some fault-tolerant systems
MIMD (Multiple Instruction, Multiple Data): Most parallel systems
- Multiple processors executing different instructions on different data
- Examples: Multi-core CPUs, clusters
Vector processors: Operate on vectors of data
- Vector registers hold multiple elements
- Vector instructions specify operation on whole vector
- Pipelined functional units
SIMD extensions in general-purpose CPUs:
- MMX (64-bit integers)
- SSE (128-bit floating-point)
- AVX (256-bit), AVX-512 (512-bit)
- Auto-vectorization by compilers
GPU SIMT (Single Instruction, Multiple Threads):
- Warps/wavefronts of threads execute same instruction
- Hardware handles divergence
Multiple processors executing independently.
Shared memory: All processors access common address space
- UMA (Uniform Memory Access): Equal access time
- NUMA (Non-Uniform Memory Access): Access time depends on location
Distributed memory: Each processor has private memory, communicate via messages
Graphics Processing Units evolved into general-purpose parallel processors.
Architecture:
- Many streaming multiprocessors (SMs)
- Each SM has many CUDA cores
- Hierarchical memory: registers, shared memory, L1/L2 cache, global memory
- Warp scheduler: Groups 32 threads executing together
Programming models:
- CUDA (NVIDIA)
- OpenCL (open standard)
- ROCm (AMD)
- SYCL (C++ abstraction)
Applications: Scientific computing, machine learning, image processing, cryptocurrency mining
Multiple CPU cores on single chip sharing resources.
Homogeneous: All cores identical (most desktop/server CPUs)
Heterogeneous: Different core types
- ARM big.LITTLE: Performance cores + efficiency cores
- Intel hybrid architecture (P-cores + E-cores)
Shared resources:
- Last-level cache
- Memory controller
- I/O interfaces
- Power management
Exploiting parallelism through multiple threads.
Multithreading types:
- Coarse-grained: Switch threads on long latency events (cache misses)
- Fine-grained: Interleave threads each cycle
- Simultaneous (SMT/Hyper-Threading): Execute multiple threads on same core
SMT implementation:
- Duplicate architectural state (registers, PC)
- Share execution resources
- Issue from multiple threads each cycle
OpenMP (Open Multi-Processing):
- Directive-based shared-memory parallel programming
- #pragma omp parallel for parallelizes loops
- Thread team executes parallel regions
- Worksharing constructs distribute iterations
CUDA (Compute Unified Device Architecture):
- GPU programming model
- Kernels: Functions executed on GPU
- Thread hierarchy: Grid → blocks → threads
- Memory hierarchy: Global, shared, local, constant
- Synchronization primitives
Multiple processors share single address space.
UMA (Uniform Memory Access):
- All memory equally distant from all processors
- Bus-based or crossbar interconnect
- Limited scalability (bus saturation)
NUMA (Non-Uniform Memory Access):
- Memory divided into local and remote
- Access local memory faster than remote
- Directory-based coherence
- Scalable to many processors
Each processor has private memory, communicate via messages.
Message passing:
- Send/receive operations
- Synchronous or asynchronous
- Collective operations (broadcast, reduce, barrier)
MPI (Message Passing Interface):
- Standard library for distributed computing
- Point-to-point and collective communication
- Widely used in HPC
Connect processors and memory in multiprocessor systems.
Topologies:
- Bus: Simple, limited scalability
- Crossbar: Full connectivity, expensive
- Ring: Simple, moderate scalability
- Mesh: 2D grid, good locality
- Torus: Mesh with wrap-around
- Hypercube: Logarithmic diameter
- Tree: Hierarchical, root bottleneck
- Fat tree: More bandwidth near root
Switching:
- Circuit switching: Path established for duration
- Packet switching: Messages divided into packets
- Wormhole routing: Packets forwarded as received
Maintain consistency across multiple caches.
Snooping protocols (bus-based):
- All caches monitor bus transactions
- MESI protocol common
- Write-invalidate: Other copies invalidated on write
- Write-update: Other copies updated (more traffic)
Directory protocols (scalable):
- Directory tracks which caches have each block
- Messages sent only to relevant caches
- Scalable to many processors
Coherence states:
- Modified: Dirty, exclusive
- Exclusive: Clean, only this cache
- Shared: Clean, possibly multiple caches
- Invalid: Not present or stale
- Owned: Modified but shared (MOESI)
Non-Uniform Memory Access with locality-aware optimization.
Cache-coherent NUMA (ccNUMA):
- Hardware maintains coherence across nodes
- Remote access slower than local
- Operating system attempts to allocate local memory
Programming considerations:
- Data placement affects performance
- Thread migration to data location
- First-touch policy (allocate memory where first accessed)
Execute instructions before knowing they're needed (past branches).
Branch speculation: Execute along predicted path
- State maintained for recovery
- Flush incorrect path on misprediction
Memory speculation: Loads before preceding stores
- Load queue tracks speculative loads
- Detect conflicts with later stores
Recovery mechanisms:
- Reorder buffer commits in order
- Checkpoint and rollback
- Selective replay of dependent instructions
Intel's simultaneous multithreading implementation.
Features:
- Two logical cores per physical core
- Duplicate architectural state
- Share execution resources
- Hardware thread scheduling
Benefits:
- Better resource utilization
- Improved throughput for multi-threaded workloads
- Lower latency hiding
Limitations:
- Resource contention can hurt single-thread performance
- Security vulnerabilities (Spectre, etc.)
Operations on arrays of data.
Vector architectures:
- Vector registers (multiple elements)
- Vector functional units (pipelined)
- Vector length register
- Vector mask for conditional execution
Vector chaining: Forward results between vector operations
Strip mining: Handle vectors longer than hardware length
Cray-style vector processors: Historical supercomputers
Compiler exposes parallelism by grouping independent operations.
VLIW characteristics:
- Long instruction packet with multiple operations
- Compiler schedules and detects hazards
- Hardware simpler (no dynamic scheduling)
- Fixed resource model
Challenges:
- Code size increase
- Variable latency (caches) hard to predict
- Binary compatibility across implementations
Examples: Intel Itanium (EPIC), some DSPs
Hardware that can adapt to application requirements.
FPGAs: Field-programmable gate arrays
- Configurable logic blocks
- Programmable interconnects
- Reconfigurable at runtime
Coarse-grained reconfigurable arrays (CGRAs):
- Larger functional blocks
- Faster reconfiguration
- Less flexible than FPGAs
Dynamic reconfiguration: Change hardware during execution
Applications: Acceleration for specific algorithms, prototyping, adaptive systems
Fundamental relationship linking performance to key parameters:
CPU time = Instruction count × Cycles per instruction × Cycle time
Or: CPU time = (Instructions × CPI) / Clock rate
Instruction count: Depends on ISA and compiler
CPI: Depends on microarchitecture, memory system, workload
Cycle time: Depends on technology, pipeline depth, critical path
Optimization trade-offs:
- Reduce instruction count (complex instructions) may increase CPI
- Reduce CPI (simpler instructions) may increase instruction count
- Increase clock rate may increase CPI (deeper pipelines, more stalls)
Benchmark suites:
- SPEC CPU: Compute-intensive
- SPECjbb: Java business
- SPECpower: Energy efficiency
- LINPACK: Dense linear algebra (Top500)
- STREAM: Memory bandwidth
- PARSEC: Multithreaded
Measurement methodology:
- Control environmental factors
- Multiple runs for statistical validity
- Report mean, median, variance
- Specify compiler and flags
Reporting metrics:
- Execution time (lower better)
- Throughput (higher better)
- Speedup relative to baseline
- Normalized results (geometric mean for ratios)
Profiling tools:
- gprof: Function-level profiling
- perf: Linux performance counters
- Intel VTune: Detailed analysis
- Valgrind: Cachegrind (cache simulation), Callgrind
Performance analysis:
- Identify hot spots (where time spent)
- Bottleneck detection (memory, compute, I/O)
- Critical path analysis
Optimization techniques:
- Algorithm improvement (better complexity)
- Compiler optimizations (-O3, profile-guided)
- Data structure optimization (cache-friendly)
- Loop transformations (unrolling, interchange, fusion)
- Vectorization
- Parallelization
Power components:
- Dynamic: CV²f (capacitance × voltage² × frequency)
- Static: Leakage current × voltage
Power management:
- DVFS (Dynamic Voltage/Frequency Scaling)
- Clock gating (disable unused logic)
- Power gating (turn off power to idle blocks)
- Race to idle (finish quickly, then sleep)
Energy efficiency metrics:
- Performance per watt
- Energy per operation
- SPECpower (performance per watt across loads)
Thermal challenges:
- Heat density increases with scaling
- Performance limited by cooling capability
- Thermal throttling protects hardware
Cooling solutions:
- Air cooling (fans, heat sinks)
- Liquid cooling
- Immersion cooling
- Phase change
Design techniques:
- Thermal-aware placement
- Dynamic thermal management
- Temperature monitoring sensors
- Emergency shutdown triggers
Complete computer system on single chip.
Components:
- Processor core (usually 8/16/32-bit)
- Memory (RAM, ROM/Flash)
- Peripherals (timers, ADC, GPIO, communication interfaces)
- Clock generation
- Power management
Popular families:
- 8-bit: AVR (Arduino), PIC, 8051
- 16/32-bit: ARM Cortex-M, RISC-V
- 32-bit: ARM Cortex-A/R, x86 (embedded)
Characteristics:
- Real-time requirements
- Low power consumption
- Cost sensitivity
- Reliability requirements
Systems with timing constraints; correctness depends on timing as well as logic.
Hard real-time: Missing deadline = system failure (airbag, flight control)
Soft real-time: Missing deadline degrades quality but system continues (video streaming)
Real-time scheduling:
- Rate-monotonic: Fixed priority, shorter period = higher priority
- Earliest deadline first: Dynamic priority
- Worst-case execution time (WCET) analysis
RTOS features:
- Deterministic behavior
- Priority-based scheduling
- Interrupt latency guarantees
- Resource locking protocols
ARM's three-profile architecture:
Cortex-A (Application):
- High performance
- Virtual memory (MMU)
- Linux, Android capable
- Cortex-A76, A78, X-series
Cortex-R (Real-time):
- High reliability
- Error correction
- Fast interrupt response
- Automotive, industrial
Cortex-M (Microcontroller):
- Low power
- Deterministic
- Simple to program
- Cortex-M0, M3, M4, M7, M33
Techniques:
- Voltage scaling (lower voltage reduces power quadratically)
- Frequency scaling (linear power reduction)
- Clock gating (disable unused logic)
- Power gating (cut power to idle blocks)
- State retention (save state before power down)
- Multiple voltage domains
Sleep modes:
- Run: Normal operation
- Sleep: Core clock stopped, peripherals active
- Deep sleep: Most clocks stopped, wake on interrupt
- Standby: Minimal power, long wake-up
Energy harvesting: Power from ambient sources (solar, vibration, thermal)
Extremely powerful systems for scientific and engineering computation.
Top500 ranking based on LINPACK performance:
- Current leaders exceed 1 exaflop (10¹⁸ floating-point ops/second)
- Systems with millions of cores
- Massive parallel processing
Architecture:
- Many compute nodes
- High-speed interconnect
- Parallel file systems
- Accelerators (GPUs, specialized processors)
Groups of commodity computers working together.
Beowulf clusters: Linux-based, Ethernet interconnect
High-availability clusters: Failover for reliability
Load balancing clusters: Distribute work across nodes
Components:
- Compute nodes (standard servers)
- Management node
- Storage (shared or distributed)
- Interconnect (Ethernet, InfiniBand, Omni-Path)
Distributed computing across organizations.
Characteristics:
- Heterogeneous resources
- Multiple administrative domains
- Middleware for resource management
- Virtual organizations
Examples: SETI@home, LHC computing grid
Large-scale data centers providing on-demand resources.
Service models:
- IaaS (Infrastructure as a Service): Virtual machines
- PaaS (Platform as a Service): Development platforms
- SaaS (Software as a Service): Applications
Cloud characteristics:
- Virtualization
- Elasticity (scale up/down)
- Pay-per-use
- Multi-tenancy
Data center design:
- Modular architecture
- Power and cooling efficiency
- Network topology (fat tree, Clos)
- Failure resilience
Ensure only trusted software executes.
Process:
- Hardware boots from immutable ROM
- ROM verifies bootloader signature
- Bootloader verifies OS kernel
- OS verifies drivers and applications
Root of trust: Hardware-protected cryptographic keys
Trusted Execution Environment (TEE):
- Isolated secure world
- Protected memory
- Secure peripherals
Hardware security chip providing cryptographic functions.
Features:
- Secure key generation/storage
- Platform Configuration Registers (PCRs) for measurement
- Remote attestation
- Sealed storage (keys tied to system state)
Applications: Disk encryption, platform integrity, digital rights management
Extract secrets through physical effects.
Types:
- Timing attacks: Measure execution time differences
- Power analysis: Monitor power consumption
- Electromagnetic analysis: Capture EM emissions
- Cache attacks: Observe cache timing/access patterns
- Spectre/Meltdown: Speculative execution leaks
Countermeasures:
- Constant-time algorithms
- Power balancing
- Noise injection
- Cache partitioning/flushing
- Speculative execution restrictions
Malicious modifications to hardware.
Insertion points:
- Design stage (rogue employee)
- Fabrication (untrusted foundry)
- Supply chain (counterfeit components)
Effects:
- Information leakage
- Denial of service
- Backdoor access
- Reduced reliability
Detection methods:
- Side-channel analysis
- Logic testing
- Reverse engineering
- Runtime monitoring
Fault types:
- Permanent: Stuck-at faults, broken connections
- Transient: Soft errors (radiation-induced)
- Intermittent: Marginal timing, temperature-dependent
Fault sources:
- Manufacturing defects
- Wear-out (electromigration, TDDB)
- Environmental (radiation, temperature)
- Design errors
Hardware redundancy:
- Triple Modular Redundancy (TMR): Three modules, vote
- Duplex with comparison
- Standby sparing
Information redundancy:
- Error-correcting codes (ECC)
- Parity
- Checksums
Time redundancy:
- Re-execution
- Alternate paths
- Retry on error detection
Software redundancy:
- N-version programming
- Recovery blocks
- Assertion checking
Error-Correcting Code memory detects and corrects errors.
Single-error correct, double-error detect (SECDED):
- Hamming code with extra parity
- Corrects single-bit errors
- Detects double-bit errors
Chipkill: Corrects failure of entire DRAM chip
Memory scrubbing: Periodically read and correct errors before accumulation
Metrics:
- MTTF (Mean Time To Failure)
- MTTR (Mean Time To Repair)
- MTBF (Mean Time Between Failures) = MTTF + MTTR
- Availability = MTTF / (MTTF + MTTR)
Reliability modeling:
- Series systems: All components must work
- Parallel systems: At least one works
- k-out-of-n: Any k of n work
Fault-tolerant design principles:
- No single point of failure
- Graceful degradation
- Fail-safe/fail-operational modes
- Fault containment
Qubit: Quantum bit, superposition of |0⟩ and |1⟩
- State: α|0⟩ + β|1⟩, where |α|² + |β|² = 1
- Measurement collapses to classical bit
Quantum gates: Unitary operations on qubits
- Pauli gates (X, Y, Z): Quantum equivalents of NOT
- Hadamard (H): Creates superposition
- Phase (S, T): Relative phase shifts
- CNOT: Controlled-NOT, entangles qubits
Sequence of quantum gates operating on qubits.
Circuit model:
- Qubits as wires
- Gates as operations
- Measurement at end
Algorithms:
- Shor's algorithm: Factor large numbers (breaks RSA)
- Grover's search: Quadratic speedup
- Quantum simulation: Model quantum systems
Qubits extremely fragile; error correction essential.
Challenges:
- No cloning theorem (can't copy qubits)
- Continuous errors (not just bit flips)
- Measurement destroys superposition
Approaches:
- Shor code: 9 qubits encode 1 logical qubit
- Surface codes: 2D lattice, topological protection
- Concatenated codes: Hierarchical encoding
Demonstrate quantum computer solving problem intractable for classical computers.
Google Sycamore (2019): 53-qubit processor, 200-second task estimated 10,000 years on classical supercomputer (contested)
Current status:
- Noisy Intermediate-Scale Quantum (NISQ) era
- Hundreds of qubits with errors
- Error correction not yet practical
- Continuing rapid progress
Google's custom ASICs for machine learning.
Architecture:
- Systolic array of multipliers
- Matrix multiply units
- Large on-chip memory
- Quantized arithmetic (bfloat16, int8)
Generations:
- TPUv1: Inference only
- TPUv2/v3: Training capable
- TPUv4: Improved performance
Dedicated hardware for neural network inference.
Common features:
- Many small processing elements
- Weight storage local to compute
- Hardware activation functions
- Dataflow optimization
Examples:
- Apple Neural Engine
- Qualcomm Hexagon DSP
- Huawei Ascend
- ARM Ethos
AI acceleration at the network edge.
Requirements:
- Low power consumption
- Real-time processing
- Privacy preservation
- Limited connectivity
Architectures:
- MicroNPUs for microcontrollers
- Ultra-low-power accelerators
- Always-on sensor processing
Applications:
- Voice activation
- Image recognition
- Predictive maintenance
- Autonomous vehicles
Basic concepts:
- Instruction mnemonics
- Addressing modes
- Directives and macros
- Calling conventions
- Example programs
Detailed analysis of significant designs:
- Intel Core microarchitecture
- AMD Zen
- ARM Cortex-A76
- RISC-V Rocket/BOOM
Review of essential mathematics:
- Boolean algebra
- Number theory (for cryptography)
- Linear algebra (for AI)
- Probability and statistics
Introduction to chip design:
- CMOS technology
- Layout and fabrication
- Design rules
- Timing closure
- Power distribution
Definitions of key terms:
- Architecture vs organization
- Pipeline hazards
- Cache coherence
- Virtual memory
- Parallel processing
- Security concepts
Classic papers and contemporary research:
- Von Neumann's EDVAC report
- Patterson's RISC papers
- Hennessy and Patterson textbooks
- ISCA/MICRO/HPCA proceedings
- Architecture conferences and journals