Created
December 5, 2025 13:23
-
-
Save johngray1965/5553e030b9a61ae095802a4b5956d04d to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # From 981ns to 82ns: A UUIDv7 Optimization Story | |
| I recently read an article on how the random nature of standard UUIDs can kill database performance due to index fragmentation. In the comments, someone mentioned UUIDv7 was designed to solve that very problem by being time-ordered and sortable. | |
| Intrigued, I did a Google search for "kotlin uuidv7" and found an open-source implementation. I was currently using the standard `UUID.randomUUID()` (v4) combined with the `fast-uuid` library for fast string conversion. I swapped in the new UUIDv7 library, ran my benchmarks, and was immediately disappointed. It was slower. | |
| But the library was open-source. And that’s where our story begins. What followed was a deep dive into the guts of UUID generation, peeling back layers of abstraction and overhead to chase every last nanosecond. This is the journey of how we made a UUIDv7 generator **~12x faster** than the native `java.util.UUID` on Android. | |
| ### The Baseline: Where We Started | |
| All benchmarks were run on a Pixel 10 Pro using Jetpack Microbenchmark, giving us stable, real-world numbers on the Android Runtime (ART). | |
| The initial situation was bleak. Our "new and improved" ID was slower than the problem it was meant to solve. Profiling was the only way forward. | |
| | Implementation | Time (Total) | Speedup | | |
| | :--- |:-------------|:------------------| | |
| | Initial UUIDv7 | ~330 ns | **Slower** | | |
| | `java.util.UUID` (v4) | 981 ns | 1x (Baseline) | | |
| ### The Optimization Journey | |
| #### Round 1: The Obvious Culprit (`SecureRandom`) | |
| The first profile immediately pointed to the entropy source. The library was using `java.security.SecureRandom`. While essential for cryptography, it's incredibly slow for generating simple IDs due to OS-level entropy gathering and thread synchronization. | |
| * **The Fix:** Swap `SecureRandom` for `java.util.concurrent.ThreadLocalRandom`. | |
| * **The Result:** Performance instantly jumped from ~330ns to **~95ns**. A huge win, proving that for non-security use cases, the choice of random number generator (RNG) is critical. | |
| #### Round 2: Chasing Allocations | |
| At 95ns, the profiler started showing "death by a thousand cuts." The code was creating intermediate `ByteArray` and `Pair` objects. Each allocation adds GC pressure and slows down execution in a tight loop. | |
| * **The Fix:** Refactor the generator to work directly on primitive `Long`s, eliminating the `ByteArray` and `Pair`. | |
| * **The Result:** We shaved off another ~40ns, bringing the generation time down to **~55ns**. | |
| #### Round 3: The `toString()` Bottleneck | |
| With generation optimized, the new bottleneck was converting the UUID to a `String`. A benchmark of the native `java.util.UUID.toString()` on Android revealed a shocking truth: it took **~675ns** on its own. | |
| Why? Unlike the desktop JVM which uses a fast native intrinsic, the AOSP implementation is often pure Java and creates **at least 7 temporary objects** (5 strings for the parts, a `StringBuilder`, and the final `String`) for every single call. | |
| * **The Fix:** Write a custom `fastUuidString` that uses a pre-allocated `CharArray` from a `ThreadLocal`. This avoids all intermediate allocations. | |
| * **The Result:** Our custom formatting logic clocked in at a mere **~48ns**. A **14x speedup** over the native method. | |
| #### Round 4: The Final Squeeze | |
| The last step was pure micro-optimization. The profiler showed that even the 32 `Long` to `Int` conversions inside the formatting function were adding up. | |
| * **The Fix:** Split the two 64-bit `Long`s (`msb`, `lsb`) into four 32-bit `Int`s *once* at the start of the function. All subsequent bit-shifting operates on the faster `Int` primitives. | |
| * **The Result:** This shaved off the final few nanoseconds, giving us our final, blazing-fast implementation. | |
| ### The Final Code | |
| Here is the culmination of that journey. It combines a single `AtomicLong` for monotonic state, zero-allocation generation logic, and the hyper-optimized `toString` implementation. | |
| kotlin @Suppress("MagicNumber") object UUIDv7 { private val HEX_DIGITS = charArrayOf('0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f') | |
| private val state = AtomicLong(0L) | |
| private val threadLocalBuffer = | |
| object : ThreadLocal<CharArray>() { | |
| override fun initialValue() = CharArray(36) | |
| } | |
| fun randomUUIDString(): String = generate { msb, lsb -> fastUuidString(msb, lsb) } | |
| private inline fun <T> generate(block: (Long, Long) -> T): T { | |
| val provider = ThreadLocalRandom.current() | |
| val now = System.currentTimeMillis() | |
| var ts: Long | |
| var newState: Long | |
| while (true) { | |
| val current = state.get() | |
| val lastTs = current ushr 16 | |
| val lastSeq = current and 0xFFF | |
| ts = if (now >= lastTs) now else lastTs | |
| val sequenceOverflow = lastSeq == 0xFFFL | |
| val needsRandom = (ts != lastTs) || sequenceOverflow | |
| val nextSeq = | |
| if (needsRandom) { | |
| if (sequenceOverflow && ts == lastTs) { | |
| ts++ | |
| } | |
| provider.nextInt(4096).toLong() | |
| } else { | |
| lastSeq + 1 | |
| } | |
| newState = (ts shl 16) or 0x7000L or nextSeq | |
| if (state.compareAndSet(current, newState)) { | |
| break | |
| } | |
| } | |
| val msb = newState | |
| val randomLow = provider.nextLong() | |
| val lsb = (randomLow ushr 2) or Long.MIN_VALUE | |
| return block(msb, lsb) | |
| } | |
| private val state = AtomicLong(0L) | |
| private val threadLocalBuffer = | |
| object : ThreadLocal<CharArray>() { | |
| override fun initialValue() = CharArray(36) | |
| } | |
| fun randomUUIDString(): String = generate { msb, lsb -> fastUuidString(msb, lsb) } | |
| private inline fun <T> generate(block: (Long, Long) -> T): T { | |
| val provider = ThreadLocalRandom.current() | |
| val now = System.currentTimeMillis() | |
| var ts: Long | |
| var newState: Long | |
| while (true) { | |
| val current = state.get() | |
| val lastTs = current ushr 16 | |
| val lastSeq = current and 0xFFF | |
| ts = if (now >= lastTs) now else lastTs | |
| val sequenceOverflow = lastSeq == 0xFFFL | |
| val needsRandom = (ts != lastTs) || sequenceOverflow | |
| val nextSeq = | |
| if (needsRandom) { | |
| if (sequenceOverflow && ts == lastTs) { | |
| ts++ | |
| } | |
| provider.nextInt(4096).toLong() | |
| } else { | |
| lastSeq + 1 | |
| } | |
| newState = (ts shl 16) or 0x7000L or nextSeq | |
| if (state.compareAndSet(current, newState)) { | |
| break | |
| } | |
| } | |
| val msb = newState | |
| val randomLow = provider.nextLong() | |
| val lsb = (randomLow ushr 2) or Long.MIN_VALUE | |
| return block(msb, lsb) | |
| } | |
| fun fastUuidString(msb: Long, lsb: Long): String { | |
| val buf = threadLocalBuffer.get()!! | |
| val digits = HEX_DIGITS | |
| val msbHi = (msb ushr 32).toInt() | |
| val msbLo = msb.toInt() | |
| val lsbHi = (lsb ushr 32).toInt() | |
| val lsbLo = lsb.toInt() | |
| buf[0] = digits[(msbHi ushr 28) and 0xF] | |
| buf[1] = digits[(msbHi ushr 24) and 0xF] | |
| buf[2] = digits[(msbHi ushr 20) and 0xF] | |
| buf[3] = digits[(msbHi ushr 16) and 0xF] | |
| buf[4] = digits[(msbHi ushr 12) and 0xF] | |
| buf[5] = digits[(msbHi ushr 8) and 0xF] | |
| buf[6] = digits[(msbHi ushr 4) and 0xF] | |
| buf[7] = digits[msbHi and 0xF] | |
| buf[8] = '-' | |
| buf[9] = digits[(msbLo ushr 28) and 0xF] | |
| buf[10] = digits[(msbLo ushr 24) and 0xF] | |
| buf[11] = digits[(msbLo ushr 20) and 0xF] | |
| buf[12] = digits[(msbLo ushr 16) and 0xF] | |
| buf[13] = '-' | |
| buf[14] = digits[(msbLo ushr 12) and 0xF] | |
| buf[15] = digits[(msbLo ushr 8) and 0xF] | |
| buf[16] = digits[(msbLo ushr 4) and 0xF] | |
| buf[17] = digits[msbLo and 0xF] | |
| buf[18] = '-' | |
| buf[19] = digits[(lsbHi ushr 28) and 0xF] | |
| buf[20] = digits[(lsbHi ushr 24) and 0xF] | |
| buf[21] = digits[(lsbHi ushr 20) and 0xF] | |
| buf[22] = digits[(lsbHi ushr 16) and 0xF] | |
| buf[23] = '-' | |
| buf[24] = digits[(lsbHi ushr 12) and 0xF] | |
| buf[25] = digits[(lsbHi ushr 8) and 0xF] | |
| buf[26] = digits[(lsbHi ushr 4) and 0xF] | |
| buf[27] = digits[lsbHi and 0xF] | |
| buf[28] = digits[(lsbLo ushr 28) and 0xF] | |
| buf[29] = digits[(lsbLo ushr 24) and 0xF] | |
| buf[30] = digits[(lsbLo ushr 20) and 0xF] | |
| buf[31] = digits[(lsbLo ushr 16) and 0xF] | |
| buf[32] = digits[(lsbLo ushr 12) and 0xF] | |
| buf[33] = digits[(lsbLo ushr 8) and 0xF] | |
| buf[34] = digits[(lsbLo ushr 4) and 0xF] | |
| buf[35] = digits[lsbLo and 0xF] | |
| return String(buf) | |
| } | |
| } | |
| ### The Final Tally | |
| | Implementation | Time (Total) | Allocations | Speedup | | |
| | :--- |:-------------|:------------|:------------------| | |
| | **Legere UUIDv7** | **82.0 ns** | **1** | **~11.9x Faster** | | |
| | `java.util.UUID` (v4) | 981 ns | ~20 | 1x (Baseline) | | |
| From a disappointing start to a result that's an order of magnitude faster than the platform default, this journey was a powerful reminder: for performance-critical code, you have to measure, profile on your target platform, and never be afraid to challenge the "standard" way of doing things. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment