s3fs.find() (and glob(), which calls find() internally) accepts a prefix
parameter that filters results server-side:
# Only return objects whose key starts with "2024-"
s3.find("my-bucket/logs", prefix="2024-")Before this fix, combining prefix with withdirs=True or maxdepth raised a
hard error:
ValueError: Can not specify 'prefix' option alongside 'withdirs'/'maxdepth' options.
This made glob() broken whenever fsspec tried to pass a prefix hint into it,
because glob() always calls find() with withdirs=True.
_find now has three distinct code paths depending on what arguments are
combined.
s3.find("my-bucket/logs", prefix="2024-")
s3.find("my-bucket/logs", prefix="2024-", withdirs=True)Uses a flat listing: ListObjectsV2 with delimiter="", which returns
every object under the path in one (paginated) response. S3 applies the prefix
server-side, so only matching keys come back. Synthesised directory entries are
built in Python from the returned key paths.
This was already correct before the fix — the only change is that
withdirs=True no longer raises.
s3.find("my-bucket/logs", maxdepth=2)
s3.find("my-bucket/logs", maxdepth=2, withdirs=True)Delegates entirely to the fsspec base-class tree walk: ListObjectsV2 with
delimiter="/" per level, recursing up to maxdepth levels deep. Efficient
because S3 only returns one "directory" at a time instead of all keys.
Unchanged by this fix.
s3.find("my-bucket/logs", prefix="2024-", maxdepth=2)
s3.find("my-bucket/logs", prefix="2024-", maxdepth=2, withdirs=True)Previously: ValueError.
Now: delimiter-based first-level listing with server-side prefix filter, then normal recursive descent into matching subdirs.
An earlier version of this fix used the flat listing path for maxdepth + prefix too, then trimmed results by depth in Python:
bucket/
logs/
2024-01/
day-01/ (1,000 objects)
day-02/ (1,000 objects)
...
day-31/ (1,000 objects)
2024-02/
...
With find("my-bucket/logs", prefix="2024-", maxdepth=1) you want only the
two top-level directory entries (2024-01/, 2024-02/, …). But the flat
listing downloads all 31,000+ objects from S3 and throws them away after
the depth check. For large buckets this is catastrophically wasteful.
Given this bucket layout:
my-bucket/
logs/
2024-01/
app/
error.log
access.log
db/
slow.log
2024-02/
app/
error.log
2023-12/ ← does NOT match prefix "2024-"
app/
error.log
Step 1 — one ListObjectsV2(Prefix="logs/2024-", Delimiter="/") call:
Contents (files at depth 1 matching prefix): (none in this example)
CommonPrefixes (dirs at depth 1 matching prefix):
logs/2024-01/
logs/2024-02/
maxdepth=1 → stop here, do not recurse.
API calls: 1. Result (withdirs=False): []. Result (withdirs=True):
["my-bucket/logs/2024-01", "my-bucket/logs/2024-02"].
2023-12/ was never fetched.
Step 1 — same first-level call as above, yields dirs 2024-01/, 2024-02/.
Step 2 — recurse into each with maxdepth=1 (no prefix, we're already
inside a matching subdir):
ListObjectsV2(Prefix="logs/2024-01/", Delimiter="/")→app/,db/ListObjectsV2(Prefix="logs/2024-02/", Delimiter="/")→app/
maxdepth exhausted, stop.
API calls: 3 (1 + 2). Result (withdirs=False): [] (all files are at depth
3). Result (withdirs=True): the four directory entries.
Step 1 — same first-level call.
Step 2 — recurse into 2024-01/ and 2024-02/ with maxdepth=2:
- Each of those calls
super()._find()(base-class tree walk withdelimiter="/") which descends two more levels and returns the actual log files.
API calls: ~5. Result: all six log files under 2024-*/.
2023-12/ is never touched at any depth.
Imagine a data lake with this layout:
my-bucket/
events/
region=us-east-1/
year=2024/
month=01/ … month=12/ (each with ~10k objects)
region=eu-west-1/
year=2024/
…
region=ap-southeast-1/
year=2023/ ← does NOT match prefix "region=us-"
…
| Call | Old behaviour | New behaviour |
|---|---|---|
find("my-bucket/events", prefix="region=us-", maxdepth=1) |
ValueError |
1 API call, returns region=us-east-1/ only |
find("my-bucket/events", prefix="region=us-", maxdepth=2) |
ValueError |
2 API calls |
find("my-bucket/events", prefix="region=us-", maxdepth=3) |
ValueError |
~14 calls (1 + 1 + 12 months) |
find("my-bucket/events", prefix="region=us-") (no maxdepth) |
worked | still flat listing — downloads all keys under region=us-*, which may be large but correct |
The region=ap-southeast-1/ subtree is never paged through regardless of depth.
glob("my-bucket/events/region=us-*/**") is the call that originally broke.
Internally, fsspec decomposes the glob pattern into a literal prefix (region=us-)
and a wildcard suffix, then calls find(..., prefix="region=us-", withdirs=True).
That combination hit the old guard immediately. With the fix it routes through the
efficient path above.