Scope: This doc is based on the hyparquet npm package (v1.24.0) source and the hyparquet-writer repo. It focuses on how hyparquet reads Parquet (schema, pages, Dremel) and why writer-side LIST/STRUCT support is still partial.
Parquet file (bytes)
|
|--[footer]-- PAR1 + metadata length + metadata thrift
|
v
metadata tree + row groups
|
|-- plan byte ranges (row groups + columns + indexes)
|
v
row group(s) -> column chunks -> pages
|
|-- decode pages (levels + values)
|-- dictionary decode + convert logical types
|
v
assemble nested (LIST/MAP/STRUCT)
|
v
rows (object or array format)
Key idea: Parquet is columnar. Hyparquet reads column pages, then re-assembles nested values using Dremel definition/repetition levels.
Hyparquet builds a schema tree from the Parquet schema elements. Each node stores:
- element (SchemaElement)
- children
- path (array of names from root)
From this tree it derives:
getSchemaPath(schema, path)=> path from root to a given logical fieldgetPhysicalColumns(tree)=> all leaf (physical) columns (dot-joined paths)- list/map shape detection (
isListLike,isMapLike)
ASCII picture for nested schema:
root
└─ person (STRUCT)
├─ name (STRING)
└─ phones (LIST)
└─ list (REPEATED)
└─ element (STRING)
physical columns:
- person.name
- person.phones.list.element
Physical columns are the leaf nodes; nested columns are reconstructed later.
Hyparquet reads metadata from the footer, validates PAR1, and parses the thrift metadata. It then builds a plan of byte ranges to fetch based on:
- row range
- selected columns
- optional filters
- optional offset indexes
Important: physical columns are used for planning, but the API works in terms of top-level logical columns.
For each selected row group:
- build a
columnDecoderfor each chunk (schema path + types) - optionally use offset indexes to fetch only necessary pages
- read column chunk pages into decoded arrays
The reader returns column chunks, not rows. Row assembly happens later.
Each data page includes:
- definition levels (optional / null tracking)
- repetition levels (LIST/MAP nesting)
- encoded values (plain, dictionary, delta, split stream, etc.)
Hyparquet logic:
readDataPage ->
read repetition levels
read definition levels
read encoded values (nValues = num_values - numNulls)
return { defLevels, repLevels, values }
This is exactly the Dremel representation that lets you reconstruct nested lists.
assembleLists() takes:
- flat values
- def + rep levels
- schema path
and walks the Dremel levels to rebuild nested arrays.
Intuition:
- def level says how "deep" a value is defined
- rep level says whether this value starts a new list or continues the previous list
ASCII mini example (list of optional ints):
rows: [ [1,2], null, [], [3] ]
def: [2,2,1,1,2] (2=max)
rep: [0,1,0,0,0]
vals: [1,2,3]
=> [[1,2], null, [], [3]]
assembleNested() gathers physical leaf columns for a struct and inverts them into row objects.
It handles:
- STRUCT (group nodes)
- LIST (list nodes)
- MAP (map nodes)
- VARIANT logical type
Key idea: nested reconstruction is post-processing of leaf columns.
Hyparquet reader supports nested data (LIST/MAP/STRUCT), but hyparquet-writer is lagging.
Current writer limitations (summary):
- STRUCT columns are blocked by the writer path: it assumes a 1:1 column->leaf mapping and throws on multi-child schema nodes.
- LIST encoding is partial: list support only triggers for the canonical LIST layout (converted_type=LIST with repeated child). No generic Dremel encoder for arbitrary nesting.
- No general Dremel encoder: writer has list-specific encoding, but not the generalized def/rep encoder needed for STRUCT, LIST-of-STRUCT, STRUCT-of-LIST, etc.
- Legacy list layouts (REPEATED leaf) are rejected by writer’s page encoder.
So: reader can assemble nested structures, but writer can’t yet generate the Dremel levels and physical leaf columns needed to round-trip them.
At a high level, hyparquet-writer must:
- Accept leaf columns with schema paths (or accept row-objects and flatten them).
- Generate definition + repetition levels for all nested structures (generalized Dremel encoder).
- Write correct page metadata (
num_rows,num_values, null counts) for repeated fields. - Validate schema vs data shape and support LIST + STRUCT layout variants.
Once those exist, the reader’s existing assembly code will work end-to-end.
[File bytes]
-> footer -> metadata
-> schema tree
-> plan byte ranges
-> row group -> column chunk -> pages
-> decode (levels + values)
-> convert + dictionary decode
-> assembleNested (LIST/MAP/STRUCT)
-> rows
Writer gap:
encodeNested (values -> levels + flat values) is missing
Hyparquet reader core:
- src/read.js
- src/rowgroup.js
- src/plan.js
- src/column.js
- src/datapage.js
- src/assemble.js
- src/schema.js
- src/metadata.js
Writer gaps:
- hyparquet-writer src/parquet-writer.js
- hyparquet-writer src/dremel.js
- hyparquet-writer src/column.js
- hyparquet-writer src/datapage.js