hyparam/hyparquet#142 For discussion: Add suffixStart option to parquetMetadataAsync
Adds a suffixStart option to parquetMetadataAsync that allows the caller to specify a byte offset to start fetching from (instead of fetching the last initialFetchSize bytes).
It's analogous to fetching bytes [idx:] (idx to EOF) instead of [-n:] (current behavior: last n bytes).
I built a dashboard that reads from append-only Parquet files in S3:
- On load: fetch last 512KB of file.
- We primarily want the footer, but that's usually much smaller, so we get (and cache) a few recent row groups as well.
- Every minute: fetch from [last row group's start offset] to EOF
- The purpose is to re-fetch the last row group (which is expected to have grown by 1 row each minute) plus the new footer.
- I know the byte offset to fetch from (from my previous footer fetch; I also expect row-group start-idxs to be immutable), but
initialFetchSizedoesn't let me express that directly.
From runsascoded/awair parquetCache.ts:
// Initial fetch: use initialFetchSize (last N bytes)
this.metadata = await parquetMetadataAsync(asyncBuffer, {
initialFetchSize: this.initialFetchSize
})
// Refresh: fetch from last RG start to EOF
const lastRgInfo = this.rowGroupInfos[this.rowGroupInfos.length - 1]
const fetchStart = lastRgInfo.startByte
this.metadata = await parquetMetadataAsync(asyncBuffer, {
suffixStart: fetchStart // "fetch from byte fetchStart to EOF"
})- Add
suffixStartoption toMetadataAsyncOptions - When provided, use it directly instead of calculating
byteLength - initialFetchSize - Backwards compatible: existing behavior unchanged
Happy to discuss whether this is the right API shape, or whether it's too niche of a use-case to bother including here.
I also used runsascoded/gh-pnpm-dist to publish 6d0c51c (from b81f95d), including this change, which I then use in my dashboard, so I'm not blocked on upstreaming this.