Skip to content

Instantly share code, notes, and snippets.

@dhrp
Created March 12, 2026 21:29
Show Gist options
  • Select an option

  • Save dhrp/2a94b9950f479bc31306b16e835f6041 to your computer and use it in GitHub Desktop.

Select an option

Save dhrp/2a94b9950f479bc31306b16e835f6041 to your computer and use it in GitHub Desktop.
s3fs _find improvement explanation

_find prefix + withdirs + maxdepth — what changed and why

Background

s3fs.find() (and glob(), which calls find() internally) accepts a prefix parameter that filters results server-side:

# Only return objects whose key starts with "2024-"
s3.find("my-bucket/logs", prefix="2024-")

Before this fix, combining prefix with withdirs=True or maxdepth raised a hard error:

ValueError: Can not specify 'prefix' option alongside 'withdirs'/'maxdepth' options.

This made glob() broken whenever fsspec tried to pass a prefix hint into it, because glob() always calls find() with withdirs=True.


The three calling modes

_find now has three distinct code paths depending on what arguments are combined.

1. No maxdepth (with or without prefix)

s3.find("my-bucket/logs", prefix="2024-")
s3.find("my-bucket/logs", prefix="2024-", withdirs=True)

Uses a flat listing: ListObjectsV2 with delimiter="", which returns every object under the path in one (paginated) response. S3 applies the prefix server-side, so only matching keys come back. Synthesised directory entries are built in Python from the returned key paths.

This was already correct before the fix — the only change is that withdirs=True no longer raises.

2. maxdepth without prefix

s3.find("my-bucket/logs", maxdepth=2)
s3.find("my-bucket/logs", maxdepth=2, withdirs=True)

Delegates entirely to the fsspec base-class tree walk: ListObjectsV2 with delimiter="/" per level, recursing up to maxdepth levels deep. Efficient because S3 only returns one "directory" at a time instead of all keys.

Unchanged by this fix.

3. maxdepth with prefix (new)

s3.find("my-bucket/logs", prefix="2024-", maxdepth=2)
s3.find("my-bucket/logs", prefix="2024-", maxdepth=2, withdirs=True)

Previously: ValueError.

Now: delimiter-based first-level listing with server-side prefix filter, then normal recursive descent into matching subdirs.


Why the naive approach (flat list + trim) was rejected

An earlier version of this fix used the flat listing path for maxdepth + prefix too, then trimmed results by depth in Python:

bucket/
  logs/
    2024-01/
      day-01/  (1,000 objects)
      day-02/  (1,000 objects)
      ...
      day-31/  (1,000 objects)
    2024-02/
      ...

With find("my-bucket/logs", prefix="2024-", maxdepth=1) you want only the two top-level directory entries (2024-01/, 2024-02/, …). But the flat listing downloads all 31,000+ objects from S3 and throws them away after the depth check. For large buckets this is catastrophically wasteful.


How the efficient approach works

Given this bucket layout:

my-bucket/
  logs/
    2024-01/
      app/
        error.log
        access.log
      db/
        slow.log
    2024-02/
      app/
        error.log
    2023-12/           ← does NOT match prefix "2024-"
      app/
        error.log

find("my-bucket/logs", prefix="2024-", maxdepth=1)

Step 1 — one ListObjectsV2(Prefix="logs/2024-", Delimiter="/") call:

Contents (files at depth 1 matching prefix):  (none in this example)
CommonPrefixes (dirs at depth 1 matching prefix):
  logs/2024-01/
  logs/2024-02/

maxdepth=1 → stop here, do not recurse.

API calls: 1. Result (withdirs=False): []. Result (withdirs=True): ["my-bucket/logs/2024-01", "my-bucket/logs/2024-02"].

2023-12/ was never fetched.


find("my-bucket/logs", prefix="2024-", maxdepth=2)

Step 1 — same first-level call as above, yields dirs 2024-01/, 2024-02/.

Step 2 — recurse into each with maxdepth=1 (no prefix, we're already inside a matching subdir):

  • ListObjectsV2(Prefix="logs/2024-01/", Delimiter="/")app/, db/
  • ListObjectsV2(Prefix="logs/2024-02/", Delimiter="/")app/

maxdepth exhausted, stop.

API calls: 3 (1 + 2). Result (withdirs=False): [] (all files are at depth 3). Result (withdirs=True): the four directory entries.


find("my-bucket/logs", prefix="2024-", maxdepth=3)

Step 1 — same first-level call.

Step 2 — recurse into 2024-01/ and 2024-02/ with maxdepth=2:

  • Each of those calls super()._find() (base-class tree walk with delimiter="/") which descends two more levels and returns the actual log files.

API calls: ~5. Result: all six log files under 2024-*/.

2023-12/ is never touched at any depth.


Comparison: many subdirs, deep nesting

Imagine a data lake with this layout:

my-bucket/
  events/
    region=us-east-1/
      year=2024/
        month=01/ … month=12/   (each with ~10k objects)
    region=eu-west-1/
      year=2024/
        …
    region=ap-southeast-1/
      year=2023/                 ← does NOT match prefix "region=us-"
        …
Call Old behaviour New behaviour
find("my-bucket/events", prefix="region=us-", maxdepth=1) ValueError 1 API call, returns region=us-east-1/ only
find("my-bucket/events", prefix="region=us-", maxdepth=2) ValueError 2 API calls
find("my-bucket/events", prefix="region=us-", maxdepth=3) ValueError ~14 calls (1 + 1 + 12 months)
find("my-bucket/events", prefix="region=us-") (no maxdepth) worked still flat listing — downloads all keys under region=us-*, which may be large but correct

The region=ap-southeast-1/ subtree is never paged through regardless of depth.


glob() connection

glob("my-bucket/events/region=us-*/**") is the call that originally broke. Internally, fsspec decomposes the glob pattern into a literal prefix (region=us-) and a wildcard suffix, then calls find(..., prefix="region=us-", withdirs=True). That combination hit the old guard immediately. With the fix it routes through the efficient path above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment