`_find` prefix + withdirs + maxdepth — what changed and why

Background

s3fs.find() (and glob(), which calls find() internally) accepts a prefix parameter that filters results server-side:

# Only return objects whose key starts with "2024-"
s3.find("my-bucket/logs", prefix="2024-")

Before this fix, combining prefix with withdirs=True or maxdepth raised a hard error:

ValueError: Can not specify 'prefix' option alongside 'withdirs'/'maxdepth' options.

This made glob() broken whenever fsspec tried to pass a prefix hint into it, because glob() always calls find() with withdirs=True.

The three calling modes

_find now has three distinct code paths depending on what arguments are combined.

1. No `maxdepth` (with or without `prefix`)

s3.find("my-bucket/logs", prefix="2024-")
s3.find("my-bucket/logs", prefix="2024-", withdirs=True)

Uses a flat listing: ListObjectsV2 with delimiter="", which returns every object under the path in one (paginated) response. S3 applies the prefix server-side, so only matching keys come back. Synthesised directory entries are built in Python from the returned key paths.

This was already correct before the fix — the only change is that withdirs=True no longer raises.

2. `maxdepth` without `prefix`

s3.find("my-bucket/logs", maxdepth=2)
s3.find("my-bucket/logs", maxdepth=2, withdirs=True)

Delegates entirely to the fsspec base-class tree walk: ListObjectsV2 with delimiter="/" per level, recursing up to maxdepth levels deep. Efficient because S3 only returns one "directory" at a time instead of all keys.

Unchanged by this fix.

3. `maxdepth` with `prefix` (new)

s3.find("my-bucket/logs", prefix="2024-", maxdepth=2)
s3.find("my-bucket/logs", prefix="2024-", maxdepth=2, withdirs=True)

Previously: ValueError.

Now: delimiter-based first-level listing with server-side prefix filter, then normal recursive descent into matching subdirs.

Why the naive approach (flat list + trim) was rejected

An earlier version of this fix used the flat listing path for maxdepth + prefix too, then trimmed results by depth in Python:

bucket/
  logs/
    2024-01/
      day-01/  (1,000 objects)
      day-02/  (1,000 objects)
      ...
      day-31/  (1,000 objects)
    2024-02/
      ...

With find("my-bucket/logs", prefix="2024-", maxdepth=1) you want only the two top-level directory entries (2024-01/, 2024-02/, …). But the flat listing downloads all 31,000+ objects from S3 and throws them away after the depth check. For large buckets this is catastrophically wasteful.

How the efficient approach works

Given this bucket layout:

my-bucket/
  logs/
    2024-01/
      app/
        error.log
        access.log
      db/
        slow.log
    2024-02/
      app/
        error.log
    2023-12/           ← does NOT match prefix "2024-"
      app/
        error.log

`find("my-bucket/logs", prefix="2024-", maxdepth=1)`

Step 1 — one ListObjectsV2(Prefix="logs/2024-", Delimiter="/") call:

Contents (files at depth 1 matching prefix):  (none in this example)
CommonPrefixes (dirs at depth 1 matching prefix):
  logs/2024-01/
  logs/2024-02/

maxdepth=1 → stop here, do not recurse.

API calls: 1. Result (withdirs=False): []. Result (withdirs=True): ["my-bucket/logs/2024-01", "my-bucket/logs/2024-02"].

2023-12/ was never fetched.

`find("my-bucket/logs", prefix="2024-", maxdepth=2)`

Step 1 — same first-level call as above, yields dirs 2024-01/, 2024-02/.

Step 2 — recurse into each with maxdepth=1 (no prefix, we're already inside a matching subdir):

ListObjectsV2(Prefix="logs/2024-01/", Delimiter="/") → app/, db/
ListObjectsV2(Prefix="logs/2024-02/", Delimiter="/") → app/

maxdepth exhausted, stop.

API calls: 3 (1 + 2). Result (withdirs=False): [] (all files are at depth 3). Result (withdirs=True): the four directory entries.

`find("my-bucket/logs", prefix="2024-", maxdepth=3)`

Step 1 — same first-level call.

Step 2 — recurse into 2024-01/ and 2024-02/ with maxdepth=2:

Each of those calls super()._find() (base-class tree walk with delimiter="/") which descends two more levels and returns the actual log files.

API calls: ~5. Result: all six log files under 2024-*/.

2023-12/ is never touched at any depth.

Comparison: many subdirs, deep nesting

Imagine a data lake with this layout:

my-bucket/
  events/
    region=us-east-1/
      year=2024/
        month=01/ … month=12/   (each with ~10k objects)
    region=eu-west-1/
      year=2024/
        …
    region=ap-southeast-1/
      year=2023/                 ← does NOT match prefix "region=us-"
        …

Call	Old behaviour	New behaviour
`find("my-bucket/events", prefix="region=us-", maxdepth=1)`	`ValueError`	1 API call, returns `region=us-east-1/` only
`find("my-bucket/events", prefix="region=us-", maxdepth=2)`	`ValueError`	2 API calls
`find("my-bucket/events", prefix="region=us-", maxdepth=3)`	`ValueError`	~14 calls (1 + 1 + 12 months)
`find("my-bucket/events", prefix="region=us-")` (no maxdepth)	worked	still flat listing — downloads all keys under `region=us-*`, which may be large but correct

The region=ap-southeast-1/ subtree is never paged through regardless of depth.

`glob()` connection

glob("my-bucket/events/region=us-*/**") is the call that originally broke. Internally, fsspec decomposes the glob pattern into a literal prefix (region=us-) and a wildcard suffix, then calls find(..., prefix="region=us-", withdirs=True). That combination hit the old guard immediately. With the fix it routes through the efficient path above.

dhrp/S3FS_FIND_CHANGES_EXPLAINED.md

Select an option

No results found

Select an option

No results found

`_find` prefix + withdirs + maxdepth — what changed and why

Background

The three calling modes

1. No `maxdepth` (with or without `prefix`)

2. `maxdepth` without `prefix`

3. `maxdepth` with `prefix` (new)

Why the naive approach (flat list + trim) was rejected

How the efficient approach works

`find("my-bucket/logs", prefix="2024-", maxdepth=1)`

`find("my-bucket/logs", prefix="2024-", maxdepth=2)`

`find("my-bucket/logs", prefix="2024-", maxdepth=3)`

Comparison: many subdirs, deep nesting

`glob()` connection

dhrp/S3FS_FIND_CHANGES_EXPLAINED.md

_find prefix + withdirs + maxdepth — what changed and why

Background

The three calling modes

1. No maxdepth (with or without prefix)

2. maxdepth without prefix

3. maxdepth with prefix (new)

Why the naive approach (flat list + trim) was rejected

How the efficient approach works

find("my-bucket/logs", prefix="2024-", maxdepth=1)

find("my-bucket/logs", prefix="2024-", maxdepth=2)

find("my-bucket/logs", prefix="2024-", maxdepth=3)

Comparison: many subdirs, deep nesting

glob() connection

`_find` prefix + withdirs + maxdepth — what changed and why

1. No `maxdepth` (with or without `prefix`)

2. `maxdepth` without `prefix`

3. `maxdepth` with `prefix` (new)

`find("my-bucket/logs", prefix="2024-", maxdepth=1)`

`find("my-bucket/logs", prefix="2024-", maxdepth=2)`

`find("my-bucket/logs", prefix="2024-", maxdepth=3)`

`glob()` connection