This document explains the key differences when writing files with s3:// vs s3a:// prefixes in Apache Flink, focusing on the Hadoop and Presto S3 connectors.
s3a://: Only registered by Hadoop S3A (viaS3AFileSystemFactory)s3://: Registered by both Hadoop S3A and Presto S3 (conflict situation)s3p://: Only registered by Presto S3 (viaS3PFileSystemFactory)
Using s3a://bucket/path:
- Always uses Hadoop S3A implementation
- No ambiguity, even if both plugins are loaded
- Uses Hadoop's native
S3AFileSystemunder the hood
Using s3://bucket/path (when both plugins loaded):
- Unpredictable behavior - depends on plugin loading order
- Last loaded filesystem factory wins and overwrites the previous registration
- Could be either Hadoop S3A or Presto S3 depending on classpath order
Hadoop S3A (s3a:// or winning s3://):
// Configuration prefixes: {"s3.", "s3a.", "fs.s3a."}
// Maps to Hadoop config with "fs.s3a." prefix
s3.access-key → fs.s3a.access.key
s3.secret-key → fs.s3a.secret.key
s3.connection.maximum → fs.s3a.connection.maximum
s3a.buffer.dir → fs.s3a.buffer.dirPresto S3 (s3p:// or winning s3://):
// Configuration prefixes: {"s3.", "presto.s3."}
// Maps to Hadoop config with "presto.s3." prefix
s3.access-key → presto.s3.access.key
s3.secret-key → presto.s3.secret.key
presto.s3.ssl → presto.s3.sslWhen both plugins are loaded and you use the ambiguous s3:// scheme:
- Generic
s3.*configs apply to whichever implementation wins - Implementation-specific configs may be ignored if the wrong implementation is selected
Both implementations use the same base architecture:
- Both extend
FlinkS3FileSystem→HadoopFileSystem - Both use
S3RecoverableWriterfor exactly-once semantics - Both support multipart uploads with 5MB minimum part size
- Both support entropy injection for sharding
Presto S3 has custom deletion logic:
// FlinkS3PrestoFileSystem.java
@Override
public boolean delete(Path path, boolean recursive) throws IOException {
if (recursive) {
deleteRecursively(path); // Custom workaround for Presto bug
} else {
deleteObject(path);
}
return true;
}Hadoop S3A uses standard Hadoop deletion:
- Delegates directly to Hadoop's
S3AFileSystem.delete() - More mature deletion handling with better error recovery
s3a://: ✅ Supports legacyFileSystemconnector (streaming file sources/sinks)s3p://: ❌ Does NOT supportFileSystemconnector (must useFileSink)
- No significant performance difference for basic file writing operations
- Both use the same multipart upload strategy
- Both support the same entropy injection patterns
s3a://: Benefits from mature Hadoop S3A testing and bug fixess3p://: May encounter Presto-specific edge cases (hence custom deletion logic)
s3a://: Better compatibility with existing Hadoop S3 configurationss3p://: Requires Presto-specific configuration knowledge
# DON'T: Ambiguous scheme (unpredictable behavior)
checkpoints.dir: s3://my-bucket/checkpoints
state.savepoints.dir: s3://my-bucket/savepoints
# DO: Explicit scheme selection
checkpoints.dir: s3p://my-bucket/checkpoints # Presto (recommended for checkpointing)
state.savepoints.dir: s3a://my-bucket/savepoints # Hadoop (if using FileSystem connector)# Universal configs (apply to both)
s3.access-key: YOUR_ACCESS_KEY
s3.secret-key: YOUR_SECRET_KEY
s3.endpoint: https://s3.amazonaws.com
# Hadoop-specific (only affects s3a://)
s3a.connection.maximum: 100
s3a.multipart.size: 104857600
# Presto-specific (only affects s3p://)
presto.s3.ssl.enabled: true
presto.s3.connect-timeout: 5sHadoop S3A (flink-s3-fs-hadoop) registers TWO factories:
S3FileSystemFactoryfors3://schemeS3AFileSystemFactoryfors3a://scheme
Presto S3 (flink-s3-fs-presto) registers TWO factories:
S3FileSystemFactoryfors3://schemeS3PFileSystemFactoryfors3p://scheme
The FS_FACTORIES map is populated at initialization:
for (FileSystemFactory factory : fileSystemFactories) {
factory.configure(config);
String scheme = factory.getScheme();
FileSystemFactory fsf = ConnectionLimitingFactory.decorateIfLimited(factory, scheme, config);
FS_FACTORIES.put(scheme, fsf); // Last one wins on conflict!
}Key Point: When multiple factories claim the same scheme, the last loaded factory wins and overwrites earlier entries.
| Configuration Key | Hadoop S3A | Presto S3 | Both s3:// |
|---|---|---|---|
s3.access-key |
YES | YES | Applied to both |
s3.secret-key |
YES | YES | Applied to both |
s3.endpoint |
YES | YES | Applied to both |
s3a.connection.maximum |
YES (via s3.connection.maximum) |
NO | Hadoop only |
presto.s3.ssl |
NO | YES | Presto only |
s3.path.style.access |
YES (via s3.path-style-access) |
YES | Applied to both |
Using s3a:// gives you:
- ✅ Predictable behavior (always Hadoop S3A)
- ✅ FileSystem connector support
- ✅ Mature Hadoop ecosystem integration
- ✅ Wider configuration compatibility
Using s3:// when both plugins loaded gives you:
⚠️ Unpredictable implementation selection⚠️ Configuration may not apply correctly⚠️ Potential runtime surprises
Recommendation: Always use explicit schemes (s3a:// or s3p://) when both plugins are present to avoid conflicts and ensure predictable behavior.