Hyparquet: mental model (reader-side) + writer gaps

Scope: This doc is based on the hyparquet npm package (v1.24.0) source and the hyparquet-writer repo. It focuses on how hyparquet reads Parquet (schema, pages, Dremel) and why writer-side LIST/STRUCT support is still partial.

1) Big picture: from bytes -> rows

Parquet file (bytes)
  |
  |--[footer]-- PAR1 + metadata length + metadata thrift
  |
  v
metadata tree + row groups
  |
  |-- plan byte ranges (row groups + columns + indexes)
  |
  v
row group(s) -> column chunks -> pages
  |
  |-- decode pages (levels + values)
  |-- dictionary decode + convert logical types
  |
  v
assemble nested (LIST/MAP/STRUCT)
  |
  v
rows (object or array format)

Key idea: Parquet is columnar. Hyparquet reads column pages, then re-assembles nested values using Dremel definition/repetition levels.

2) Schema model

Hyparquet builds a schema tree from the Parquet schema elements. Each node stores:

element (SchemaElement)
children
path (array of names from root)

From this tree it derives:

getSchemaPath(schema, path) => path from root to a given logical field
getPhysicalColumns(tree) => all leaf (physical) columns (dot-joined paths)
list/map shape detection (isListLike, isMapLike)

ASCII picture for nested schema:

root
└─ person (STRUCT)
   ├─ name (STRING)
   └─ phones (LIST)
      └─ list (REPEATED)
         └─ element (STRING)

physical columns:
- person.name
- person.phones.list.element

Physical columns are the leaf nodes; nested columns are reconstructed later.

3) Metadata + planning

Hyparquet reads metadata from the footer, validates PAR1, and parses the thrift metadata. It then builds a plan of byte ranges to fetch based on:

row range
selected columns
optional filters
optional offset indexes

Important: physical columns are used for planning, but the API works in terms of top-level logical columns.

4) Row groups + column chunks

For each selected row group:

build a columnDecoder for each chunk (schema path + types)
optionally use offset indexes to fetch only necessary pages
read column chunk pages into decoded arrays

The reader returns column chunks, not rows. Row assembly happens later.

5) Page decoding + Dremel levels

Each data page includes:

definition levels (optional / null tracking)
repetition levels (LIST/MAP nesting)
encoded values (plain, dictionary, delta, split stream, etc.)

Hyparquet logic:

readDataPage ->
  read repetition levels
  read definition levels
  read encoded values (nValues = num_values - numNulls)
  return { defLevels, repLevels, values }

This is exactly the Dremel representation that lets you reconstruct nested lists.

6) Reconstructing nested data

6.1 LIST reconstruction

assembleLists() takes:

flat values
def + rep levels
schema path

and walks the Dremel levels to rebuild nested arrays.

Intuition:

def level says how "deep" a value is defined
rep level says whether this value starts a new list or continues the previous list

ASCII mini example (list of optional ints):

rows: [ [1,2], null, [], [3] ]

def:  [2,2,1,1,2]   (2=max)
rep:  [0,1,0,0,0]
vals: [1,2,3]

=> [[1,2], null, [], [3]]

6.2 STRUCT reconstruction

assembleNested() gathers physical leaf columns for a struct and inverts them into row objects. It handles:

STRUCT (group nodes)
LIST (list nodes)
MAP (map nodes)
VARIANT logical type

Key idea: nested reconstruction is post-processing of leaf columns.

7) Why writer-side LIST/STRUCT is still partial

Hyparquet reader supports nested data (LIST/MAP/STRUCT), but hyparquet-writer is lagging.

Current writer limitations (summary):

STRUCT columns are blocked by the writer path: it assumes a 1:1 column->leaf mapping and throws on multi-child schema nodes.
LIST encoding is partial: list support only triggers for the canonical LIST layout (converted_type=LIST with repeated child). No generic Dremel encoder for arbitrary nesting.
No general Dremel encoder: writer has list-specific encoding, but not the generalized def/rep encoder needed for STRUCT, LIST-of-STRUCT, STRUCT-of-LIST, etc.
Legacy list layouts (REPEATED leaf) are rejected by writer’s page encoder.

So: reader can assemble nested structures, but writer can’t yet generate the Dremel levels and physical leaf columns needed to round-trip them.

8) What’s needed to fully support LIST + STRUCT in writer

At a high level, hyparquet-writer must:

Accept leaf columns with schema paths (or accept row-objects and flatten them).
Generate definition + repetition levels for all nested structures (generalized Dremel encoder).
Write correct page metadata (num_rows, num_values, null counts) for repeated fields.
Validate schema vs data shape and support LIST + STRUCT layout variants.

Once those exist, the reader’s existing assembly code will work end-to-end.

9) Mental model summary (one-page)

[File bytes]
  -> footer -> metadata
  -> schema tree
  -> plan byte ranges
  -> row group -> column chunk -> pages
  -> decode (levels + values)
  -> convert + dictionary decode
  -> assembleNested (LIST/MAP/STRUCT)
  -> rows

Writer gap:
  encodeNested (values -> levels + flat values) is missing

10) References (source files)

Hyparquet reader core:

src/read.js
src/rowgroup.js
src/plan.js
src/column.js
src/datapage.js
src/assemble.js
src/schema.js
src/metadata.js

Writer gaps:

hyparquet-writer src/parquet-writer.js
hyparquet-writer src/dremel.js
hyparquet-writer src/column.js
hyparquet-writer src/datapage.js

cfahlgren1/hyparquet-mental-model.md

Select an option

No results found