Here's a negative result around compressing and caching ASTC parameters to reduce transfer overhead.
Taking an Adreno 650 as our reference:
- GPU - 1.2 TFlops (1200 GFlops)
- CPU SIMD int8 - 18 GFlops
- CPU single core - up to 2.2 GFlops
- DRAM bandwidth - 44 GB/s
- DRAM -> GPU load bandwidth - 44 GB/s (Adreno/mobile GPUs are all integrated with direct DRAM access)
- Disk -> DRAM load bandwidth - 2 GB/s
This creates a set of "frontiers" for what aspect (compute, memory, disk) of a purely optimal compute kernel is the bottleneck:
- A kernel is (GPU) compute bound if it reaches more than ~28 FLOPs per byte of memory read, and it reaches more than 600 FLOPs per byte of disk read or written
- A kernel is memory bound if it cannot achieve ~28 FLOPs per byte of memory read, and it reaches more than 22 bytes of memory load for every byte of disk read
- A kernel is disk bound if it cannot achieve 600 FLOPs per byte of disk read and it cannot achieve 22 bytes of memory load for every byte of disk read
Notice the scaling here:
- GPU O(1200 GFLOPs)
- (~30x)>>> DRAM O(20 GB/s) ~ SIMD int8 O(20 GFLOPs)
- (~20x)>>> Disk O(2 GB/s) ~ CPU single core O(2 GFLOPS).
This inherent unbalanced hierarchy is what makes disk caching vs just a raw compute shader such a hard tradeoff, because in order for it to be worth it, you need to either:
- Heavily amplify the size of the cache so that you're effectively achieving 1:22 compression rates to avoid the disk->memory transfer being the bottleneck, OR
- Have a shader with such high arithmetic intensity that it needs to perform 600 FLOPs for each byte read.
Before tackling #1, I want to disprove #2. For a simple astc transcoder, we work on 16 bits per pixel (per thread) from non-BC1 textures. This gives us a budget of 1.6 KFlops per thread. Our transcoder is very simple, and often finishes in less than
The transcoder thread will read (on average) a single pixel (16 bits encoded in BC), and write a single pixel in astc (16 bits encoded in astc). This gives us a budget of ~120 flops before we're compute bound. As it stands, we're only slightly compute bound today (thanks largely to the complexity of astc block packing).
Next, I will show you how it's impossible to get 1:22 compression (or even close to 1:2) to make disk caching worth it.
NOTE: this math is different for the CPU SIMD units, which is only about
I want to create a file format for caching the parameters of predetermined astc images (where every block is on a uniform block mode, described below). This format should ideally reduce the amount of file IO and memory transfer, hence we want to compress the object while keeping the code reasonably memory-bound (not too high of arithmetic intensity for your typical ARM/Adreno iGPU)
- Only support 4x4 rgb_ldr and rgba_ldr (single plane) for full color ep range, but weights up to 4 bits
- Is not a direct astc format, but contains all of the necessary parameters to pack into the equivalent .astc object
Parameters:
Natively, this block can be packed into 128 bits (16 bytes)
BC1 can be transcoded into a simpler format:
Natively, this block can be packed into 64 bits (8 bytes), same as, well, BC1.
In general, can we do better than this?
We can pack this as the final astc block, but just removed the first 17 bits of redundant information. This will take us down to 111 bits per block, which is a marginal improvement, at the cost of a significantly increased packing complexity.
We're not going to consider this option at the moment.
Let's break the components up into
We can think of a set of blocks that are encoded within a single shared subgroup (say 64 blocks, or 256 pixels) as a walk on each of the components.
Let's start with a sequence of the red color endpoint components:
now, we expect some locality here such that
let's say that
given this, a typical walk over the component
as long as the frequency of the
Note that for RGBA (64 uint8s), a
For simplicity, we can choose a huffman code with a per-subgroup (64 blocks) shared dictionary/tree for:
- The RGB components, since they share very similar structures
- A distinct alpha channel, since it's likely that they will just have a single value for most subgroups
- A distinct weight channel, since this is the most variable aspect of this whole scheme
Since a group/chunk of blocks are no longer deterministically mapped to a fixed point, we need to also include a central directory (cd) lookup.
We'll go with the naive approach of just specifying
The aim is to do this process in subgroups of 64 threads on an iGPU (ARM/Adreno). Each subgroup will compute its subgroup id, and then find its group start offset in the file (already mapped into memory). Then it will perform either the encoding (from astc/bc1 weights to astc) or decoding (from the .fastc group object into astc)
Using device: cuda
Image processed into 129600 blocks.
Processing 100 groups of 64 blocks...
--- Compression Efficacy Report ---
Original Image Blocks: 6400
Original ASTC Size: 100.00 KB
-----------------------------------
Compressed Size Breakdown:
- Huffman Data: 67.72 KB
- Huffman Trees: 7.35 KB
- Huffman RGB Data: 18.67 KB
- Huffman RGB Trees: 4.65 KB
- Huffman W Data: 49.06 KB
- Huffman W Trees: 2.54 KB
- Absolute Values: 3.54 KB
- Central Directory: 0.39 KB
-----------------------------------
Total Compressed Size: 79.00 KB
Compression Ratio: 1.27:1
SAVING: 21.00%
RGB Codebook: [(0, '00'), (1, '011'), (-2, '1010'), (-1, '1111'), (2, '1011'), (3, '0100'), (4, '1001'), (16, '1101'), (-4, '11000'), (-3, '10001'), (6, '01011'), (-8, '110010'), (-6, '010100'), (5, '111011'), (7, '010101'), (10, '110011'), (-15, '1000000'), (-10, '1000001'), (-7, '1110011'), (-5, '1110100'), (8, '1000010'), (9, '1000011'), (11, '1110101'), (13, '1110000'), (-14, '11100010'), (-9, '11100100'), (14, '11100101'), (-11, '111000110'), (15, '111000111')]
RGB Freq: Counter({0: 70, 1: 44, -1: 37, 16: 30, 2: 26, -2: 25, 4: 24, 3: 19, -4: 13, -3: 12, 6: 11, 5: 8, 10: 7, -8: 7, 7: 5, -6: 5, 11: 4, -5: 4, -7: 4, 8: 3, -10: 3, -15: 3, 13: 3, 9: 3, -9: 2, -14: 2, 14: 2, 15: 1, -11: 1})
Alpha Codebook: {0: ''}
Alpha Freq: Counter({0: 126})
Weight Codebook: [(0, '0001'), (1, '0110'), (2, '0011'), (3, '0100'), (4, '0111'), (5, '1000'), (6, '0101'), (7, '1011'), (8, '1110'), (9, '1111'), (10, '1100'), (11, '1010'), (12, '1001'), (13, '1101'), (14, '0000'), (15, '0010')]
Weight Freq: Counter({9: 83, 8: 82, 13: 80, 10: 78, 7: 68, 11: 66, 12: 65, 5: 63, 4: 61, 1: 59, 6: 57, 3: 57, 2: 56, 15: 55, 0: 49, 14: 45})
Weight deltas: [2, -1, -1, -1, -3, -1, 0, -4, 0, 2, -2, -1, 0, 0, 0, 14, -2, -1, 0, -4, 2, 2, 0, -8, 2, 2, 0, -5, 0, -1, 0, 11, 0, -3, 2, -7, 3, 0, 3, -5, 0, 0, -2, -1, 1, 0, -2, 11, 2, -3, 0, -2, 0, 0, 0, -4, 0, 0, 0, -3, 0, 0, -3, 15, 0, 0, 0, 0, 0, 0, -5, 0, 0, -5, 0, 0, 0, 0, -5, 14, 0, 0, 0, -4, 0, 0, 0, -5, 0, 0, 0, -4, 0, 0, 0, 11, 0, 0, -2, 3, -3, 0, 0, -1, 0, 0, -4, -5, 2, 0, 0, 13, 0, 0, 0, -6, 0, 0, -6, -1, 0, 0, 0, 0, 3, 0, 0, 6, -1, -3, 0, 0, 0, -3, 2, -2, 0, 1, 4, -3, 0, 0, 0, 3, -3, 2, -3, 4, 0, 2, 0, -1, 0, 0, 0, -3, 0, 0, 0, 0, 0, -4, 0, 1, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, -4, -4, 0, 0, 6, -5, 0, 6, 0, -2, 0, 0, 0, 0, 0, 0, 0, -7, 0, 3, 7, -6, 0, 4, 3, -1, 0, 1, 0, -4, 0, -3, 1, 4, 0, 0, 3, -3, 3, 0, 0, -6, 6, 0, 0, -13, 1, 2, 0, 9, 0, -1, 1, -2, 0, 2, 1, -7, -2, 3, 2, -10, 1, 1, 3, 3, -1, -3, -3, 14, 0, -3, -4, 3, 4, 0, -5, -6, 4, 0, 0, -7, 2, 2, 2, -5, 1, 4, 3, -8, 0, 4, 4, -2, 0, 3, 4, -15, 3, 5, 0, -1, 2, 2, 3, -7, 0, 2, 4, 0, -4, 2, 3, -3, -1, 0, 1, -4, 1, 0, 0, -8, 2, 2, 2, 1, 2, 4, 0, -4, -4, 4, 4, -9, 0, 3, 7, -13, 3, 4, 3, -3, 0, 3, 0, -6, -2, 0, 3, -5, 5, 1, -5, 4, -2, -1, 6, -2, -3, -1, 12, 0, -2, -6, 1, -2, -4, 0, 6, -2, -6, 1, 3, 3, -7, 1, 3, 11, -11, -1, 9, 2, -13, -1, 9, -5, -2, 2, 4, -3, 2, 2, 0, 5, -1, 1, 1, -3, 0, -2, 0, -2, -1, -6, 0, 7, -3, -3, 6, 1, 4, -6, 2, 1, 4, -3, 2, 1, -4, 2, -1, 4, -14, 6, -7, 11, -4, 1, 1, 6, -2, 0, 0, -3, 3, -1, -1, -11, 12, 1, 1, -14, 1, 0, 0, 9, -1, -1, -3, 6, 0, 0, -1, 4, 0, 1, -1, -11, -2, -1, 0, 6, -2, -1, 0, 8, 0, 1, 0, 3, 0, 0, 0, -14, 0, 1, 1, 0, 0, 1, 1, 8, -1, -1, -1, 5, 0, -1, -1, -9, -1, -1, -2, 6, 0, 0, 0, 5, 1, 2, 0, 1, -1, -2, 0, -11, -1, 1, -1, 3, 0, 0, -1, 7, -1, 0, -1, 8, -1, -1, -2, -11, 1, 1, 0, 1, -1, 1, -1, 6, 0, -2, 0, 9, -3, -2, -2, -7, 0, 5, 4, -7, 0, 4, 4, -5, 0, 0, 5, -8, 0, 6, 2, -9, 0, 5, 0, -2, 0, 4, 0, -7, 5, 2, 4, -10, 0, 7, 0, -10, 3, 3, 0, -3, 0, 6, 3, -9, 6, 3, 0, -6, 6, 3, 0, -15, 0, 5, 4, -9, 0, 5, 5, -10, 5, 5, 5, -10, 5, 0, 5, -14, 9, 3, -4, -2, 0, -3, 0, 9, -7, -6, 3, 2, 2, 4, 1, -1, 0, 2, 0, 0, 2, -2, -5, 4, 1, -5, -6, 6, -5, -3, 2, 7, -1, -4, 2, 1, 0, -4, 0, -2, 5, -1, -1, -3, 11, 3, -4, -3, -1, -3, -4, 10, 5, -4, -8, 5, 5, -1, -5, 1, -1, 2, 0, 1, -1, -7, 1, 6, 4, -6, -7, 4, 11, -2, -10, 5, 0, -2, -2, 4, 0, 0, 3, -10, 6, 1, -1, -7, 9, 3, -2, -9, 8, 4, 1, 0, -1, 0, -1, 1, 0, 0, 0, -6, 1, 1, 4, -13, 1, 4, 6, -7, 0, 8, 0, -3, -4, 3, 4, -3, -4, -1, 4, 1, -4, -4, 3, 1, 0, 0, 0, 10, 0, 0, -5, 0, 0, 0, 0, -10, 0, 0, 0, 13, 0, 0, 0, -1, 0, -4, 0, 4, -4, 0, 0, -8, 0, 0, 0, 14, -2, -3, -2, 2, 0, -4, 0, 2, 5, -2, -7, 1, 6, -3, -6, 11, 0, 1, 1, -2, 1, 0, 0, 0, 2, -3, -5, 6, 2, -6, -9, 12, -3, 2, 2, -5, 5, -1, -11, 10, -9, -2, 6, 4, -3, 0, 2, 3, -6, -2, 5, -3, 6, -5, -7, 14, -2, -4, 1, 4, -3, -3, 2, 2, -7, -3, 4, -4, 0, 6, 7, -5, 1, 3, -3, 3, -1, 2, -1, -12, 0, 4, 4, 0, -4, 2, 0, -4, 5, 7, -2, -8, 0, 2, -4, 6, -1, 1, 2, -8, -2, 1, 5, -5, -1, 1, 2, 3, 4, 1, -4, 4, -4, 1, 5, -2, -1, -2, 1, -2, 3, -2, -3, -3, 2, 4, 0, 0, 2, 2, 2, -10, 2, 1, 2, -7, 1, 2, 0, -4, 0, 1, -2, 14, -3, -2, -2, 2, -2, -4, -2, 6, -2, -4, 0, 1, 1, 0, -2, 11, 0, -3, -3, 7, 0, 0, 0, -13, 3, 4, 0, -4, 0, 0, 2, -2, -3, 0, 0, 12, 1, -3, 0, -6, 0, 0, 0, -3, 4, 3, 0, 6, -3, -3, 0, -7, 3, 3, 3, -9, 9, 3, 0, -12, 5, 3, -9, 9, 3, 2, -2, -2, 2, 0, -3, 0, 1, -2, -4, -5, 3, 0, -2, 4, 0, 1, 4, -2, 0, 2, 5, -8, 1, 1, 1, -10, 1, 3, -1, 7, 1, 0, -1, 4, 1, 0, -2, -4, -1, -1, -1, -4, -2, 0, 1, 12, 0, 0, 0, -4, 0, 2, 0, -5, -1, 2, -1, -5, 0, -1, 0, 13, -2, 0, -1, 1, -1, -2, 0, -3, -1, 0, 0, -4, 0, 1, 1]
- The good - the color endpoints (256 values) are very compressible, usually going for ~50% size reduction with the combination of delta encoding and huffman coding
- In fact, if you forego huffman coding, just doing a 4-bit integer encoding for the top 15 highest frequency will account for ~82% of your components. So the math works out to be
$4 \times 0.82 + 12 \times 0.18 = 5.5$ bits per component for a 32% reduction in size.
- In fact, if you forego huffman coding, just doing a 4-bit integer encoding for the top 15 highest frequency will account for ~82% of your components. So the math works out to be
- The bad - the weights (16 values) are generally not very compressible. As raw weights, they're generally uniformly distributed when plotted as a histogram. As delta-encodings, beyond a spike at 0, they are generally still uniformly distributed. As a result, the best you can generally do is to encode them directly as 4-bit fields packed together.
So, with an optimal huffman encoder for the color endpoints, you can expect about a 22% improvement in size (3% of extra overhead outside of the huffman coding). With a suboptimal frequency code for the color endpoints, you can expect about a 15% improvement in size (which is identical to option 1, but with a lot more arithmetic intensity).
In the limit, this does not really represent a big improvement for transfer sizes if we are going to offload the intermediate parameters of the astc to a disk cache.
Because the actual decode + encode (BC 6/7) / transcode (BC 1) pipeline is already so fast, a 25% reduction to the astc format will not eliminate the IO stall that will likely underperform (significantly) the disk-less GPU-accelerated real-time transcoding.