Skip to content

Instantly share code, notes, and snippets.

@nkabrown
Last active February 26, 2026 19:51
Show Gist options
  • Select an option

  • Save nkabrown/ff875beaf882ea1ab3a88b23921eef42 to your computer and use it in GitHub Desktop.

Select an option

Save nkabrown/ff875beaf882ea1ab3a88b23921eef42 to your computer and use it in GitHub Desktop.
Designing Data-Intensive Applications – Chapter 4

Encoding and Evolution

The need to evolve the system often means that you need to maintain backwards and forwards compatibility at the same time.

  • Backwards compatibility – newer code can read data that was written by older code
  • Forwards compatibility – older code can read data that was written by newer code (older code ignores additions made by newer code)

Formats for Encoding Data

In memory data is kept in a wide-variety of data structures optimized for efficient access and manipulation. When writing data to file or sending over the network the data must be encoded in some kind of self-contained sequence of bytes.

data structure representation (i.e., objects, list, arrays, hash table, etc.)

sequence-of-bytes representation (i.e., JSON, protocol buffers, YAML, etc.)

Translating from the in-memory representation to a sequence-of-bytes is called encoding (serialization) and the reverse is called decoding (deserialization).

Disadvantages of built-in programming language encoding libraries:

  • encoding is often not translatable into other programming languages, not portable
  • often able to instantiate arbitrary classes which is a secury vulnerability
  • versioning data is an afterthought so backwards and forwards compatibility is challenging
  • efficency of the encoding and decoding operations is often poor

Looking for a encoding format that is language independent.

Textual Formats

JSON, XML, CSV, YAML

Textual formats don't have implicit type systems and so suffer from ambiguity. One example being numbers. Binary strings have to be encoded as text using Base64 so all binary strings grow 33% larger. There are no built-in schemas and while external schemas exist for most textual formats but they don't reduce size or guarantee evolutino. Support for schemas is weak and even weaker for CSV. Textual formats are schema-on-read and the schema is implicit in the data itself. Textual formats are popular as data interchange formats to enable the moving of data between different systems and organizations.

Binary Formats

More compact and faster to parse than text. Binary formats have schemas.

Schema evolution - inevitably all schemas must change over time. How can you manage schema changes while maintaining forwards and backwards compatibility?

Notes

Why is the choice of encoding so important to the ability of your system to evolve?

Schema-on-Read notes

Schema-on-Read (JSON/YAML/XML)

  # Data contains its own structure
  users:
    - name: Alice
      age: 30
      email: alice@example.com
    - name: Bob
      age: 25
      email: bob@example.com

Characteristics:

  • Schema is implicit in the data itself
  • Field names written with every record
  • Reader interprets structure at parse time
  • Flexible but verbose and error-prone

Evolution challenges:

  • No guarantees about what fields exist
  • Type mismatches discovered at runtime
  • No clear contract between writer and reader

Schema-on-Write (Protobuf/Thrift/Avro)

  // Schema defined separately
  message User {
    required string name = 1;
    required int32 age = 2;
    optional string email = 3;
  }

Characteristics:

  • Schema defined separately from data
  • Field names not stored in data (use field numbers/tags)
  • Writer encodes according to schema, reader decodes according to schema
  • Compact and type-safe

Evolution benefits:

  • Clear rules for adding/removing fields
  • Type checking at encode/decode time
  • Explicit compatibility guarantees (forward/backward compatibility)

YAML – "There are many kinds of data structures, but they can all be adequately represented with three basic primitives: mappings (hashes/dictionaries), sequences (arrays/lists) and scalars (strings/numbers). YAML leverages these primitives and adds a simple typing system and aliasing mechanism to form a complete language for serializing any native data structure.

YAML represents any native data structure using three node kinds: sequence - an ordered series of entries; mapping - an unordered association of unique keys to values; and scalar - any datum with opaque structure presentable as a series of Unicode characters.

Combined, these primitives generate directed graph structures."

JSON and JSON spec

Data Interchange Format – A data interchange format is a format designed to move data between different systems with these characteristics:

Core Requirements:

  1. Serialization - Convert in-memory objects to bytes and back
  2. Language independence - Work across different programming languages
  3. Evolution support - Handle schema changes over time
  4. Documentation - Clear structure and types

Key Distinction:

NOT Interchange Interchange Format
In-memory structures Serialized bytes
Language-specific Language-agnostic
Single process Crosses boundaries
No schema Schema (explicit or implicit)

Choose your interchange format based on:

  • JSON/XML - Human readability, universal support, REST APIs
  • Protobuf/Thrift - Efficiency, strong typing, RPC/microservices
  • Avro - Schema evolution, data pipelines, streaming
  • Parquet - Analytics, data warehouses, not general interchange

The format becomes the contract between systems - get it right, and systems can evolve independently. Get it wrong (e.g., using Python pickle between services), and you create tight coupling and fragility.

The interchange format is the boundary where systems meet - it must be stable, well-documented, and evolution-friendly to enable long-term system health.

The Interchange Spectrum

Formats exist on a spectrum:

  <--- More Specialized            General Purpose Interchange            More Specialized --->

  In-Memory     Language-Specific      Cross-Language         Storage-Optimized
  Structures    Serialization          RPC/Messaging          Formats

  [Python       [Python Pickle]    [JSON/XML]           [Avro]    [Parquet]
   dict/list]   [Java Serialization] [Protobuf/Thrift]           [ORC]
                [.NET BinaryFormatter]

Left Side (Not Interchange)

  • In-memory only
  • Single language/process
  • No serialization

Middle (Primary Interchange)

  • JSON, XML - Text-based, universal
  • Protobuf, Thrift - Binary RPC
  • Avro - Binary with schema evolution

Right Side (Specialized Interchange)

  • Parquet, ORC - Columnar analytics
  • Used between data processing systems
  • But optimized for storage/analytics, not general messaging
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment