The process-local cache problem

The goal of this document is to describe GDAL's current caching mechanisms, how they fail at scale, and how distributed caching could help. We'll be using COG headers as an example as it is what I know the best; but this applies to other formats like Zarr.

Caching Mechanisms

GDAL maintains two caching mechanisms; the "VSI cache" used to cache recently accessed I/O through any VSI driver (ex. a COG header over http) and the "block cache" used to cache raster blocks. I am not familiar with the implemention details of either cache; the important thing for this discussion is the cache is local per process.

Failure Case

When cloud workloads get large enough we must scale horizontally across multiple processes and potentially multiple machines. Here are a few real world examples of that:

eoAPI contains a kubernetes (k8s) deployment of titiler which adds new containers as CPU usage increases, scaling up as traffic against the dynamic tile server increases.
rioxarray may be used along with dask.distributed to processes large rasters that don't fit within memory of a single machine.

The details are a bit different, but in both cases we are running many independent GDAL processes (via rasterio) across potentially multiple machines. The eoAPI case is the simpler one, k8s is using round robin load balancing to distribute work across containers (gdal processes). Round robin load balancing forwards a request to each container in turn. If there are 10 containers all serving content from the same COG, the same COG header must be opened 10 times before it is fully cached (VSI cache) across all containers.

This begins to matter at big scales, the following diagram shows the efficiency of a process-local cache as we increase number of containers (gdal processes). Where efficiency is measured as the number of requests it takes to fully cache the COG header (VSI cache) across all containers in comparison to the number of requests it takes with a perfectly distributed cache. We are also assuming that GDAL_INGESTED_BYTES_AT_OPEN is set large enough to fetch the COG header in a single request.

The figure above is very specific to the titiler / round-robin use case. It will look different for other things (ex. dask.distributed) but the overall trend will always remain the same; process-local caches become less efficient as you scale horizontally.

The larger point is that GDAL's process-local cache is fundamentally incompatible with the goals of the cloud. The CNG community is increasingly standardizing on the blob store (S3, GCS etc.) as the appropriate datastore for raster data using cloud native formats like COG and Zarr. Best practices around blob store access encourage the user to send fewer larger requests, which encourages caching as a design best practice to reduce duplicate requests to the blob store. Cloud architectures follow the "scale-out" model which scales in more processes as the size of workloads increase. GDAL's existing process-local caches become less efficient, in some cases exponentially, as we scale up the number of processes.

Solutions

There are of course many solutions to this problem, many of which are already available:

This is use case driven, like most things in this document, but imagine running zonal stats on Landsat imagery for all building footprints in the world. Knowing that Landsat is stored against a semi-regular grid, you could partition your building footprints ahead of time and distribute them across GDAL processes in a way that optimizes caching against the Landsat imagery. This works well but is bespoke and often times difficult to build and scale.
Implement your own caching on top of rasterio, or however else you consume GDAL. This works well in theory; but not as well in practice as rasterio doesn't expose the raw bytes from the blob store which prevents implementing some of the more advanced caching strategy. It also requires using a cache-aside strategy which is less efficient from a networking perspective and usually suffers from inelastic scaling. rasteret is a great example of a cache-aside strategy (disk) on top of rasterio, although it does not work across multiple machines.
Add caching between your GDAL processes and the blob store. The most straightforward approach is deploying a CDN like AWS CloudFront but this can be extremely expensive. An alternate approach is building your own forward-proxy, see a POC here.

GDAL could implement a distributed cache to help alleviate the process-local caching problem at scale. Groupcache is the reference example for this sort of distributed caching; and I've found this blog post helpful in understanding how it works. A distributed cache filling library like groupcache would, for example, let VSI caches across GDAL processes across potentially multiple machines share data with each other. A COG header requested by one process would be made available to another, without going back to the blob store. Lastly, I'll add that distributed caching is more effective on low-cardinality data and is therefore more important for the VSI cache (one header per image); and less important for the block cache (multiple blocks per image). I don't think there are any groupcache implementations in C/C++, and I can't comment on the level-of-effort to implement this in GDAL as I'm not familiar with the implementation of the underlying caches.

An alternative solution is providing rasterio (and other client library) users more direct access to the underlying I/O, allowing them to implement their own caching strategy on top of GDAL. Pluggable I/O is very popular, and usually how libraries solve this sort of problem. Many libraries don't implement caching at all, leaving it entirely to the user; this is personally my preferred approach. I understand that rasterio recently released "custom openers" (see here) which provide just this; but it doesn't seem to work reliably and has issues at scale. For example any python objects used by the opener cannot be resused which prevents the use of connection pooling against a cache. It also seems to be quite a bit slower than using GDAL's native I/O, but I haven't benchmarked this explicitely. In general I worry about the feasibility of this approach given how much of GDAL's user base consumes the code through some other language wrapper.

geospatial-jeff/distributed_caching.md

Select an option

No results found