Three PRs have attempted to optimize Vitess streaming by reducing the overhead of MySQL-to-vtgate data transfer. PRs #15522 and #17135 showed no meaningful performance gains. PR #19620 achieves +5.6% QPS on OLTP-READONLY-OLAP and +35% QPS on TPCC-OLAP with statistical significance (p=0.000). This document analyzes why.
Embed raw MySQL packets directly into protobuf QueryResult.raw_packets field, deferring parsing to the vtgate client via lazy ParseResult(). Added EWMA-based buffer size prediction for pre-allocation.
MySQL → readPacketAsProto() → QueryResult.raw_packets (protobuf bytes field)
→ gRPC (cached proto, skip re-marshal)
→ vtgate: mysql.ParseResult() lazily parses on access
Key design: Pack raw MySQL packets into the existing QueryResult protobuf message as a repeated bytes raw_packets field. The server builds the protobuf wire format directly using protowire, then caches the serialized bytes to skip re-marshaling at gRPC send time.
-
Incomplete implementation: The server-side code explicitly disabled raw packets (
request.Options.RawMysqlPackets = falsewith TODOs in multiple locations). Large packet handling waspanic("TODO: large packets"). The PR was never in a runnable state — benchmarks were never generated. -
Same protobuf envelope: Raw packets were stuffed into the existing
QueryResultmessage. Even though re-serialization was avoided via caching, the vtgate client still received a standardStreamExecuteResponseand had to detect + handle theraw_packetspath. The protobuf framing overhead remained. -
Only targeted Execute, not StreamExecute: The
grpctabletconnclient changes only modifiedExecute(),BeginExecute(),ReserveExecute(),ReserveBeginExecute(). The streaming RPCs (StreamExecuteetc.) were untouched. Since the OLAP benchmarks use streaming, this PR wouldn't have moved the needle on those workloads even if completed. -
Added complexity without removing it: EWMA buffer prediction,
queryResultBuilder,vtcachedMessageinterface — all added code paths but didn't eliminate the fundamental cost: protobuf framing. Each raw packet was still individually length-prefixed as a protobufbytesfield withinQueryResult, adding per-packet overhead. -
Closed due to inactivity (auto-closed May 11, 2024 after 30 days stale). No reviewer engagement, no benchmarks.
PR #15522 was an exploratory WIP that never reached a testable state. Its core limitation was trying to optimize within the existing protobuf message structure rather than bypassing it entirely.
Read raw MySQL packets as mem.Buffer objects, prefix each with a protobuf tag, and send via gRPC's mem.BufferSlice zero-copy mechanism. The vtgate client parses raw packets from ExecuteResponse.raw_packets.
MySQL → readPacketAsMemBuffer() → mem.Buffer with proto tag prefix
→ gRPC mem.BufferSlice (zero-copy send)
→ vtgate: updateResFromRaw() parses raw_packets back to sqltypes.Result
Key design: Use gRPC's mem.Buffer API to avoid copying raw MySQL packets during serialization. Each packet is prefixed with 5 bytes (protobuf field tag + length) so gRPC can send the BufferSlice directly without marshaling.
-
40% allocation reduction, 0% throughput improvement: vmg's analysis (Nov 21 comment) was definitive — despite reducing allocation costs by 40% on the vttablet side, there was no QPS increase, no latency reduction, and no observable end-to-end benefit.
-
The v2 codec had already captured the wins: vmg noted that PR #16790 (gRPC v2 codec) had already been shipped weeks prior, and "most of the performance wins to be had with pooling were already covered" by it. The remaining allocations being eliminated were not memory ballast that triggered additional GC.
-
Per-packet protobuf framing remained: Each raw MySQL packet was individually wrapped with a protobuf tag prefix. For a streaming query returning thousands of small rows (e.g., 100-byte rows), this means thousands of 5-byte prefixes + individual
mem.Bufferobjects. The overhead of managing these many small buffers offset the savings from avoidingsqltypes.Resultallocation. -
Parsing still happened on both sides: vttablet's
readPacketAsMemBuffer()still needed to read and understand each packet to add the correct protobuf prefix. vtgate'supdateResFromRaw()then parsed the raw packets intosqltypes.Result. The total parsing work across the system was roughly the same — it just moved from vttablet to vtgate. -
No batching: Each MySQL packet was sent as an individual
raw_packetsentry. A query with 10,000 rows meant 10,000+ entries inraw_packets. In contrast, PR #19620 batches packets into 256KB chunks, dramatically reducing per-row overhead. -
16KB buffer pool too small: The
MemBufReaderused a 16KB default buffer pool. For typical streaming workloads that read many rows, this meant frequent pool interactions and potential fragmentation.
PR #17135 proved that reducing allocations on one side of the stack is not sufficient when the fundamental data flow architecture remains the same. The per-packet protobuf framing and lack of batching meant the optimization was swallowed by other costs.
Entirely new gRPC RPCs (StreamExecuteRaw etc.) that send raw MySQL wire protocol bytes as opaque 256KB chunks. vttablet does zero parsing — it reads raw bytes from MySQL's buffered reader directly into a pooled buffer and sends them over gRPC. vtgate parses the chunks once via RawResultParser.
MySQL → ReadHeaderInto/ReadDataInto into pooled 256KB buffer
→ gRPC: StreamExecuteRawResponse{raw: buf[:n]} (thin wrapper)
→ vtgate: RawResultParser.Feed(chunk) → sqltypes.Result
| Step | Old path | PR #17135 | PR #19620 |
|---|---|---|---|
| Read MySQL packet | readEphemeralPacket() → alloc per packet |
readPacketAsMemBuffer() → alloc per packet |
ReadDataInto(buf[offset:]) → into pooled buffer |
| Parse fields | readColumnDefinition() → querypb.Field |
Still parsed for proto framing | Skipped entirely |
| Parse rows | parseRow() → []sqltypes.Value |
Still parsed for proto framing | Skipped entirely |
| Serialize | vtproto marshal StreamExecuteResponse |
mem.Buffer with proto tag prefix |
StreamExecuteRawResponse{raw: buf} (trivial) |
PR #19620 eliminates ALL parsing and allocation on the vttablet side. The MySQL bytes flow from the kernel's TCP buffer → Go's buffered reader → the 256KB output buffer → gRPC. No sqltypes.Result, no querypb.Field, no per-row []sqltypes.Value.
PR #17135 wrapped each MySQL packet individually in protobuf:
[proto tag (5B)][MySQL packet 1] [proto tag (5B)][MySQL packet 2] ...
For 10,000 rows of 100 bytes each: 10,000 proto tags = 50KB overhead, 10,000 mem.Buffer objects.
PR #19620 batches all packets into 256KB chunks:
[MySQL packet 1][MySQL packet 2]...[MySQL packet N] // up to 256KB
For the same 10,000 rows: ~4 chunks, 4 gRPC messages, zero per-packet overhead.
The 256KB buffer is pooled via sync.Pool:
var rawStreamBufPool = sync.Pool{New: func() any {
b := make([]byte, rawStreamBufSize)
return &b
}}Across concurrent streaming queries, buffers are reused. Compare:
- Old path:
make([]byte, length)per packet inreadOnePacket(), plusallocStreamResult()from pool for result batching - PR #17135: 16KB pool buffers, one per packet
- PR #19620: One 256KB buffer per query, pooled across queries
streamQueryResultPackets reads directly into the output buffer with minimal syscalls:
- Header read:
ReadHeaderInto(buf[bufOffset:])— directly into output buffer, no intermediate copy - Payload read: single
ReadDataInto(buf[bufOffset:bufOffset+packetLength])in the fast path (payload fits in remaining buffer) - Context check only on error, not per-packet
The old path does: readHeaderFrom() + io.ReadFull() per packet + parseRow() + MakeTrusted() + appendRow() + vtproto marshal. PR #19620 does: ReadHeaderInto() + ReadDataInto() → done.
The RawResultParser.Feed() fast path processes the gRPC response bytes directly without copying:
if len(p.buf) > 0 {
p.buf = append(p.buf, chunk...) // Rare: leftover from split packet
data = p.buf
} else {
data = chunk // Common: process in-place, zero copy
}This avoids a 256KB memcpy per gRPC message in the common case.
PRs #15522 and #17135 both added fields to existing proto messages (QueryResult.raw_packets, ExecuteOptions.raw_mysql_packets). This meant:
- Detection logic on both sides to check if raw packets were present
- Fallback codepaths for non-raw responses
- Both old and new fields in the same message
PR #19620 defines entirely new RPCs (StreamExecuteRaw, BeginStreamExecuteRaw, etc.) with dedicated request/response messages. No detection needed — if you call the raw RPC, you get raw bytes. Clean separation.
| Dimension | PR #15522 | PR #17135 | PR #19620 |
|---|---|---|---|
| Status | WIP, never runnable | Complete, benchmarked | Complete, benchmarked |
| Parsing on vttablet | Partial (field types needed) | Partial (proto tags needed) | None |
| Parsing on vtgate | Lazy, deferred | Full, from raw_packets | Full, from raw chunks |
| Batching | Per-packet in proto array | Per-packet in mem.Buffer | 256KB chunks |
| Buffer management | EWMA prediction | 16KB pool | 256KB sync.Pool |
| gRPC framing | Within existing messages | Within existing messages | Dedicated new RPCs |
| Per-packet overhead | Proto bytes field tag |
Proto tag + mem.Buffer | Zero (packets concatenated) |
| OLTP-READONLY-OLAP QPS | N/A (never ran) | ~0% change | +5.63% |
| TPCC-OLAP QPS | N/A | ~0% change | +35.46% |
| Memory (OLAP) | N/A | ~0% change | -18.04% |
The previous PRs tried to optimize the encoding of results (smarter serialization, zero-copy buffers). PR #19620 eliminates the encoding entirely on the hot side (vttablet) and batches the data into bulk chunks. The performance win comes not from doing the same work faster, but from doing fundamentally less work on the vttablet side and amortizing the remaining work (vtgate parsing) over large chunks.