Encoding and Evolution

The need to evolve the system often means that you need to maintain backwards and forwards compatibility at the same time.

Backwards compatibility – newer code can read data that was written by older code
Forwards compatibility – older code can read data that was written by newer code (older code ignores additions made by newer code)

Formats for Encoding Data

In memory data is kept in a wide-variety of data structures optimized for efficient access and manipulation. When writing data to file or sending over the network the data must be encoded in some kind of self-contained sequence of bytes.

data structure representation (i.e., objects, list, arrays, hash table, etc.)

sequence-of-bytes representation (i.e., JSON, protocol buffers, YAML, etc.)

Translating from the in-memory representation to a sequence-of-bytes is called encoding (serialization) and the reverse is called decoding (deserialization).

Disadvantages of built-in programming language encoding libraries:

encoding is often not translatable into other programming languages, not portable
often able to instantiate arbitrary classes which is a secury vulnerability
versioning data is an afterthought so backwards and forwards compatibility is challenging
efficency of the encoding and decoding operations is often poor

Looking for a encoding format that is language independent.

Textual Formats

JSON, XML, CSV, YAML

Textual formats don't have implicit type systems and so suffer from ambiguity. One example being numbers. Binary strings have to be encoded as text using Base64 so all binary strings grow 33% larger. There are no built-in schemas and while external schemas exist for most textual formats but they don't reduce size or guarantee evolutino. Support for schemas is weak and even weaker for CSV. Textual formats are schema-on-read and the schema is implicit in the data itself. Textual formats are popular as data interchange formats to enable the moving of data between different systems and organizations.

Binary Formats

More compact and faster to parse than text. Binary formats have schemas.

Schema evolution - inevitably all schemas must change over time. How can you manage schema changes while maintaining forwards and backwards compatibility?

Notes

Why is the choice of encoding so important to the ability of your system to evolve?

Schema-on-Read notes

Schema-on-Read (JSON/YAML/XML)

  # Data contains its own structure
  users:
    - name: Alice
      age: 30
      email: alice@example.com
    - name: Bob
      age: 25
      email: bob@example.com

Characteristics:

Schema is implicit in the data itself
Field names written with every record
Reader interprets structure at parse time
Flexible but verbose and error-prone

Evolution challenges:

No guarantees about what fields exist
Type mismatches discovered at runtime
No clear contract between writer and reader

Schema-on-Write (Protobuf/Thrift/Avro)

  // Schema defined separately
  message User {
    required string name = 1;
    required int32 age = 2;
    optional string email = 3;
  }

Characteristics:

Schema defined separately from data
Field names not stored in data (use field numbers/tags)
Writer encodes according to schema, reader decodes according to schema
Compact and type-safe

Evolution benefits:

Clear rules for adding/removing fields
Type checking at encode/decode time
Explicit compatibility guarantees (forward/backward compatibility)

YAML – "There are many kinds of data structures, but they can all be adequately represented with three basic primitives: mappings (hashes/dictionaries), sequences (arrays/lists) and scalars (strings/numbers). YAML leverages these primitives and adds a simple typing system and aliasing mechanism to form a complete language for serializing any native data structure.

YAML represents any native data structure using three node kinds: sequence - an ordered series of entries; mapping - an unordered association of unique keys to values; and scalar - any datum with opaque structure presentable as a series of Unicode characters.

Combined, these primitives generate directed graph structures."

JSON and JSON spec

Data Interchange Format – A data interchange format is a format designed to move data between different systems with these characteristics:

Core Requirements:

Serialization - Convert in-memory objects to bytes and back
Language independence - Work across different programming languages
Evolution support - Handle schema changes over time
Documentation - Clear structure and types

Key Distinction:

NOT Interchange	Interchange Format
In-memory structures	Serialized bytes
Language-specific	Language-agnostic
Single process	Crosses boundaries
No schema	Schema (explicit or implicit)

Choose your interchange format based on:

JSON/XML - Human readability, universal support, REST APIs
Protobuf/Thrift - Efficiency, strong typing, RPC/microservices
Avro - Schema evolution, data pipelines, streaming
Parquet - Analytics, data warehouses, not general interchange

The format becomes the contract between systems - get it right, and systems can evolve independently. Get it wrong (e.g., using Python pickle between services), and you create tight coupling and fragility.

The interchange format is the boundary where systems meet - it must be stable, well-documented, and evolution-friendly to enable long-term system health.

The Interchange Spectrum

Formats exist on a spectrum:

  <--- More Specialized            General Purpose Interchange            More Specialized --->

  In-Memory     Language-Specific      Cross-Language         Storage-Optimized
  Structures    Serialization          RPC/Messaging          Formats

  [Python       [Python Pickle]    [JSON/XML]           [Avro]    [Parquet]
   dict/list]   [Java Serialization] [Protobuf/Thrift]           [ORC]
                [.NET BinaryFormatter]

Left Side (Not Interchange)

In-memory only
Single language/process
No serialization

Middle (Primary Interchange)

JSON, XML - Text-based, universal
Protobuf, Thrift - Binary RPC
Avro - Binary with schema evolution

Right Side (Specialized Interchange)

Parquet, ORC - Columnar analytics
Used between data processing systems
But optimized for storage/analytics, not general messaging

nkabrown/chpt-04.md

Select an option

No results found

Select an option

No results found

Encoding and Evolution