Skip to content

Instantly share code, notes, and snippets.

@ryan-williams
Last active December 5, 2025 02:06
Show Gist options
  • Select an option

  • Save ryan-williams/178195c03afd8ee3741e690141c9016c to your computer and use it in GitHub Desktop.

Select an option

Save ryan-williams/178195c03afd8ee3741e690141c9016c to your computer and use it in GitHub Desktop.

hyparam/hyparquet#142 For discussion: Add suffixStart option to parquetMetadataAsync

Adds a suffixStart option to parquetMetadataAsync that allows the caller to specify a byte offset to start fetching from (instead of fetching the last initialFetchSize bytes).

It's analogous to fetching bytes [idx:] (idx to EOF) instead of [-n:] (current behavior: last n bytes).

Motivation: polling append-only Parquet files

I built a dashboard that reads from append-only Parquet files in S3:

  1. On load: fetch last 512KB of file.
    • We primarily want the footer, but that's usually much smaller, so we get (and cache) a few recent row groups as well.
  2. Every minute: fetch from [last row group's start offset] to EOF
    • The purpose is to re-fetch the last row group (which is expected to have grown by 1 row each minute) plus the new footer.
    • I know the byte offset to fetch from (from my previous footer fetch; I also expect row-group start-idxs to be immutable), but initialFetchSize doesn't let me express that directly.

Example usage

From runsascoded/awair parquetCache.ts:

// Initial fetch: use initialFetchSize (last N bytes)
this.metadata = await parquetMetadataAsync(asyncBuffer, {
  initialFetchSize: this.initialFetchSize
})

// Refresh: fetch from last RG start to EOF
const lastRgInfo = this.rowGroupInfos[this.rowGroupInfos.length - 1]
const fetchStart = lastRgInfo.startByte
this.metadata = await parquetMetadataAsync(asyncBuffer, {
  suffixStart: fetchStart  // "fetch from byte fetchStart to EOF"
})

Changes

  • Add suffixStart option to MetadataAsyncOptions
  • When provided, use it directly instead of calculating byteLength - initialFetchSize
  • Backwards compatible: existing behavior unchanged

Happy to discuss whether this is the right API shape, or whether it's too niche of a use-case to bother including here.

I also used runsascoded/gh-pnpm-dist to publish 6d0c51c (from b81f95d), including this change, which I then use in my dashboard, so I'm not blocked on upstreaming this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment