Skip to content

Instantly share code, notes, and snippets.

@cfahlgren1
Created January 23, 2026 22:42
Show Gist options
  • Select an option

  • Save cfahlgren1/072e77e0ce38ae5818d1dd42d762fa06 to your computer and use it in GitHub Desktop.

Select an option

Save cfahlgren1/072e77e0ce38ae5818d1dd42d762fa06 to your computer and use it in GitHub Desktop.
hyparquet mental model + writer list/struct gaps

Hyparquet: mental model (reader-side) + writer gaps

Scope: This doc is based on the hyparquet npm package (v1.24.0) source and the hyparquet-writer repo. It focuses on how hyparquet reads Parquet (schema, pages, Dremel) and why writer-side LIST/STRUCT support is still partial.


1) Big picture: from bytes -> rows

Parquet file (bytes)
  |
  |--[footer]-- PAR1 + metadata length + metadata thrift
  |
  v
metadata tree + row groups
  |
  |-- plan byte ranges (row groups + columns + indexes)
  |
  v
row group(s) -> column chunks -> pages
  |
  |-- decode pages (levels + values)
  |-- dictionary decode + convert logical types
  |
  v
assemble nested (LIST/MAP/STRUCT)
  |
  v
rows (object or array format)

Key idea: Parquet is columnar. Hyparquet reads column pages, then re-assembles nested values using Dremel definition/repetition levels.


2) Schema model

Hyparquet builds a schema tree from the Parquet schema elements. Each node stores:

  • element (SchemaElement)
  • children
  • path (array of names from root)

From this tree it derives:

  • getSchemaPath(schema, path) => path from root to a given logical field
  • getPhysicalColumns(tree) => all leaf (physical) columns (dot-joined paths)
  • list/map shape detection (isListLike, isMapLike)

ASCII picture for nested schema:

root
└─ person (STRUCT)
   ├─ name (STRING)
   └─ phones (LIST)
      └─ list (REPEATED)
         └─ element (STRING)

physical columns:
- person.name
- person.phones.list.element

Physical columns are the leaf nodes; nested columns are reconstructed later.


3) Metadata + planning

Hyparquet reads metadata from the footer, validates PAR1, and parses the thrift metadata. It then builds a plan of byte ranges to fetch based on:

  • row range
  • selected columns
  • optional filters
  • optional offset indexes

Important: physical columns are used for planning, but the API works in terms of top-level logical columns.


4) Row groups + column chunks

For each selected row group:

  • build a columnDecoder for each chunk (schema path + types)
  • optionally use offset indexes to fetch only necessary pages
  • read column chunk pages into decoded arrays

The reader returns column chunks, not rows. Row assembly happens later.


5) Page decoding + Dremel levels

Each data page includes:

  • definition levels (optional / null tracking)
  • repetition levels (LIST/MAP nesting)
  • encoded values (plain, dictionary, delta, split stream, etc.)

Hyparquet logic:

readDataPage ->
  read repetition levels
  read definition levels
  read encoded values (nValues = num_values - numNulls)
  return { defLevels, repLevels, values }

This is exactly the Dremel representation that lets you reconstruct nested lists.


6) Reconstructing nested data

6.1 LIST reconstruction

assembleLists() takes:

  • flat values
  • def + rep levels
  • schema path

and walks the Dremel levels to rebuild nested arrays.

Intuition:

  • def level says how "deep" a value is defined
  • rep level says whether this value starts a new list or continues the previous list

ASCII mini example (list of optional ints):

rows: [ [1,2], null, [], [3] ]

def:  [2,2,1,1,2]   (2=max)
rep:  [0,1,0,0,0]
vals: [1,2,3]

=> [[1,2], null, [], [3]]

6.2 STRUCT reconstruction

assembleNested() gathers physical leaf columns for a struct and inverts them into row objects. It handles:

  • STRUCT (group nodes)
  • LIST (list nodes)
  • MAP (map nodes)
  • VARIANT logical type

Key idea: nested reconstruction is post-processing of leaf columns.


7) Why writer-side LIST/STRUCT is still partial

Hyparquet reader supports nested data (LIST/MAP/STRUCT), but hyparquet-writer is lagging.

Current writer limitations (summary):

  • STRUCT columns are blocked by the writer path: it assumes a 1:1 column->leaf mapping and throws on multi-child schema nodes.
  • LIST encoding is partial: list support only triggers for the canonical LIST layout (converted_type=LIST with repeated child). No generic Dremel encoder for arbitrary nesting.
  • No general Dremel encoder: writer has list-specific encoding, but not the generalized def/rep encoder needed for STRUCT, LIST-of-STRUCT, STRUCT-of-LIST, etc.
  • Legacy list layouts (REPEATED leaf) are rejected by writer’s page encoder.

So: reader can assemble nested structures, but writer can’t yet generate the Dremel levels and physical leaf columns needed to round-trip them.


8) What’s needed to fully support LIST + STRUCT in writer

At a high level, hyparquet-writer must:

  1. Accept leaf columns with schema paths (or accept row-objects and flatten them).
  2. Generate definition + repetition levels for all nested structures (generalized Dremel encoder).
  3. Write correct page metadata (num_rows, num_values, null counts) for repeated fields.
  4. Validate schema vs data shape and support LIST + STRUCT layout variants.

Once those exist, the reader’s existing assembly code will work end-to-end.


9) Mental model summary (one-page)

[File bytes]
  -> footer -> metadata
  -> schema tree
  -> plan byte ranges
  -> row group -> column chunk -> pages
  -> decode (levels + values)
  -> convert + dictionary decode
  -> assembleNested (LIST/MAP/STRUCT)
  -> rows

Writer gap:
  encodeNested (values -> levels + flat values) is missing

10) References (source files)

Hyparquet reader core:

  • src/read.js
  • src/rowgroup.js
  • src/plan.js
  • src/column.js
  • src/datapage.js
  • src/assemble.js
  • src/schema.js
  • src/metadata.js

Writer gaps:

  • hyparquet-writer src/parquet-writer.js
  • hyparquet-writer src/dremel.js
  • hyparquet-writer src/column.js
  • hyparquet-writer src/datapage.js
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment