Deep Dive: Why PR #19620 Succeeds Where #15522 and #17135 Failed

Context

Three PRs have attempted to optimize Vitess streaming by reducing the overhead of MySQL-to-vtgate data transfer. PRs #15522 and #17135 showed no meaningful performance gains. PR #19620 achieves +5.6% QPS on OLTP-READONLY-OLAP and +35% QPS on TPCC-OLAP with statistical significance (p=0.000). This document analyzes why.

PR #15522 — "[wip] Raw MySQL Packets" (vmg, Mar 2024)

Approach

Embed raw MySQL packets directly into protobuf QueryResult.raw_packets field, deferring parsing to the vtgate client via lazy ParseResult(). Added EWMA-based buffer size prediction for pre-allocation.

Architecture

MySQL → readPacketAsProto() → QueryResult.raw_packets (protobuf bytes field)
      → gRPC (cached proto, skip re-marshal)
      → vtgate: mysql.ParseResult() lazily parses on access

Key design: Pack raw MySQL packets into the existing QueryResult protobuf message as a repeated bytes raw_packets field. The server builds the protobuf wire format directly using protowire, then caches the serialized bytes to skip re-marshaling at gRPC send time.

Why It Failed

Incomplete implementation: The server-side code explicitly disabled raw packets (request.Options.RawMysqlPackets = false with TODOs in multiple locations). Large packet handling was panic("TODO: large packets"). The PR was never in a runnable state — benchmarks were never generated.
Same protobuf envelope: Raw packets were stuffed into the existing QueryResult message. Even though re-serialization was avoided via caching, the vtgate client still received a standard StreamExecuteResponse and had to detect + handle the raw_packets path. The protobuf framing overhead remained.
Only targeted Execute, not StreamExecute: The grpctabletconn client changes only modified Execute(), BeginExecute(), ReserveExecute(), ReserveBeginExecute(). The streaming RPCs (StreamExecute etc.) were untouched. Since the OLAP benchmarks use streaming, this PR wouldn't have moved the needle on those workloads even if completed.
Added complexity without removing it: EWMA buffer prediction, queryResultBuilder, vtcachedMessage interface — all added code paths but didn't eliminate the fundamental cost: protobuf framing. Each raw packet was still individually length-prefixed as a protobuf bytes field within QueryResult, adding per-packet overhead.
Closed due to inactivity (auto-closed May 11, 2024 after 30 days stale). No reviewer engagement, no benchmarks.

Summary

PR #15522 was an exploratory WIP that never reached a testable state. Its core limitation was trying to optimize within the existing protobuf message structure rather than bypassing it entirely.

PR #17135 — "Transmit raw MySQL packets with grpc buffer" (harshit-gangal, Nov 2024)

Approach

Read raw MySQL packets as mem.Buffer objects, prefix each with a protobuf tag, and send via gRPC's mem.BufferSlice zero-copy mechanism. The vtgate client parses raw packets from ExecuteResponse.raw_packets.

Architecture

MySQL → readPacketAsMemBuffer() → mem.Buffer with proto tag prefix
      → gRPC mem.BufferSlice (zero-copy send)
      → vtgate: updateResFromRaw() parses raw_packets back to sqltypes.Result

Key design: Use gRPC's mem.Buffer API to avoid copying raw MySQL packets during serialization. Each packet is prefixed with 5 bytes (protobuf field tag + length) so gRPC can send the BufferSlice directly without marshaling.

Why It Failed

40% allocation reduction, 0% throughput improvement: vmg's analysis (Nov 21 comment) was definitive — despite reducing allocation costs by 40% on the vttablet side, there was no QPS increase, no latency reduction, and no observable end-to-end benefit.
The v2 codec had already captured the wins: vmg noted that PR #16790 (gRPC v2 codec) had already been shipped weeks prior, and "most of the performance wins to be had with pooling were already covered" by it. The remaining allocations being eliminated were not memory ballast that triggered additional GC.
Per-packet protobuf framing remained: Each raw MySQL packet was individually wrapped with a protobuf tag prefix. For a streaming query returning thousands of small rows (e.g., 100-byte rows), this means thousands of 5-byte prefixes + individual mem.Buffer objects. The overhead of managing these many small buffers offset the savings from avoiding sqltypes.Result allocation.
Parsing still happened on both sides: vttablet's readPacketAsMemBuffer() still needed to read and understand each packet to add the correct protobuf prefix. vtgate's updateResFromRaw() then parsed the raw packets into sqltypes.Result. The total parsing work across the system was roughly the same — it just moved from vttablet to vtgate.
No batching: Each MySQL packet was sent as an individual raw_packets entry. A query with 10,000 rows meant 10,000+ entries in raw_packets. In contrast, PR #19620 batches packets into 256KB chunks, dramatically reducing per-row overhead.
16KB buffer pool too small: The MemBufReader used a 16KB default buffer pool. For typical streaming workloads that read many rows, this meant frequent pool interactions and potential fragmentation.

Summary

PR #17135 proved that reducing allocations on one side of the stack is not sufficient when the fundamental data flow architecture remains the same. The per-packet protobuf framing and lack of batching meant the optimization was swallowed by other costs.

PR #19620 — "StreamExecuteRaw" (current, arthurschreiber, Mar 2026)

Approach

Entirely new gRPC RPCs (StreamExecuteRaw etc.) that send raw MySQL wire protocol bytes as opaque 256KB chunks. vttablet does zero parsing — it reads raw bytes from MySQL's buffered reader directly into a pooled buffer and sends them over gRPC. vtgate parses the chunks once via RawResultParser.

Architecture

MySQL → ReadHeaderInto/ReadDataInto into pooled 256KB buffer
      → gRPC: StreamExecuteRawResponse{raw: buf[:n]} (thin wrapper)
      → vtgate: RawResultParser.Feed(chunk) → sqltypes.Result

Why It Succeeds

1. Complete bypass of vttablet-side parsing

Step	Old path	PR #17135	PR #19620
Read MySQL packet	`readEphemeralPacket()` → alloc per packet	`readPacketAsMemBuffer()` → alloc per packet	`ReadDataInto(buf[offset:])` → into pooled buffer
Parse fields	`readColumnDefinition()` → `querypb.Field`	Still parsed for proto framing	Skipped entirely
Parse rows	`parseRow()` → `[]sqltypes.Value`	Still parsed for proto framing	Skipped entirely
Serialize	vtproto marshal `StreamExecuteResponse`	`mem.Buffer` with proto tag prefix	`StreamExecuteRawResponse{raw: buf}` (trivial)

PR #19620 eliminates ALL parsing and allocation on the vttablet side. The MySQL bytes flow from the kernel's TCP buffer → Go's buffered reader → the 256KB output buffer → gRPC. No sqltypes.Result, no querypb.Field, no per-row []sqltypes.Value.

2. Bulk chunking instead of per-packet framing

PR #17135 wrapped each MySQL packet individually in protobuf:

[proto tag (5B)][MySQL packet 1] [proto tag (5B)][MySQL packet 2] ...

For 10,000 rows of 100 bytes each: 10,000 proto tags = 50KB overhead, 10,000 mem.Buffer objects.

PR #19620 batches all packets into 256KB chunks:

[MySQL packet 1][MySQL packet 2]...[MySQL packet N] // up to 256KB

For the same 10,000 rows: ~4 chunks, 4 gRPC messages, zero per-packet overhead.

3. Pooled buffer eliminates allocation churn

The 256KB buffer is pooled via sync.Pool:

var rawStreamBufPool = sync.Pool{New: func() any {
    b := make([]byte, rawStreamBufSize)
    return &b
}}

Across concurrent streaming queries, buffers are reused. Compare:

Old path: make([]byte, length) per packet in readOnePacket(), plus allocStreamResult() from pool for result batching
PR #17135: 16KB pool buffers, one per packet
PR #19620: One 256KB buffer per query, pooled across queries

4. Optimized I/O on vttablet

streamQueryResultPackets reads directly into the output buffer with minimal syscalls:

Header read: ReadHeaderInto(buf[bufOffset:]) — directly into output buffer, no intermediate copy
Payload read: single ReadDataInto(buf[bufOffset:bufOffset+packetLength]) in the fast path (payload fits in remaining buffer)
Context check only on error, not per-packet

The old path does: readHeaderFrom() + io.ReadFull() per packet + parseRow() + MakeTrusted() + appendRow() + vtproto marshal. PR #19620 does: ReadHeaderInto() + ReadDataInto() → done.

5. Zero-copy Feed on vtgate

The RawResultParser.Feed() fast path processes the gRPC response bytes directly without copying:

if len(p.buf) > 0 {
    p.buf = append(p.buf, chunk...)  // Rare: leftover from split packet
    data = p.buf
} else {
    data = chunk  // Common: process in-place, zero copy
}

This avoids a 256KB memcpy per gRPC message in the common case.

6. New RPCs avoid backward-compatibility overhead

PRs #15522 and #17135 both added fields to existing proto messages (QueryResult.raw_packets, ExecuteOptions.raw_mysql_packets). This meant:

Detection logic on both sides to check if raw packets were present
Fallback codepaths for non-raw responses
Both old and new fields in the same message

PR #19620 defines entirely new RPCs (StreamExecuteRaw, BeginStreamExecuteRaw, etc.) with dedicated request/response messages. No detection needed — if you call the raw RPC, you get raw bytes. Clean separation.

Comparative Summary

Dimension	PR #15522	PR #17135	PR #19620
Status	WIP, never runnable	Complete, benchmarked	Complete, benchmarked
Parsing on vttablet	Partial (field types needed)	Partial (proto tags needed)	None
Parsing on vtgate	Lazy, deferred	Full, from raw_packets	Full, from raw chunks
Batching	Per-packet in proto array	Per-packet in mem.Buffer	256KB chunks
Buffer management	EWMA prediction	16KB pool	256KB sync.Pool
gRPC framing	Within existing messages	Within existing messages	Dedicated new RPCs
Per-packet overhead	Proto `bytes` field tag	Proto tag + mem.Buffer	Zero (packets concatenated)
OLTP-READONLY-OLAP QPS	N/A (never ran)	~0% change	+5.63%
TPCC-OLAP QPS	N/A	~0% change	+35.46%
Memory (OLAP)	N/A	~0% change	-18.04%

The Core Insight

The previous PRs tried to optimize the encoding of results (smarter serialization, zero-copy buffers). PR #19620 eliminates the encoding entirely on the hot side (vttablet) and batches the data into bulk chunks. The performance win comes not from doing the same work faster, but from doing fundamentally less work on the vttablet side and amortizing the remaining work (vtgate parsing) over large chunks.

arthurschreiber/deep-dive-streaming-prs.md

Select an option

No results found

Select an option

No results found

Deep Dive: Why PR #19620 Succeeds Where #15522 and #17135 Failed

Context

PR #15522 — "[wip] Raw MySQL Packets" (vmg, Mar 2024)

Approach

Architecture

Why It Failed

Summary

PR #17135 — "Transmit raw MySQL packets with grpc buffer" (harshit-gangal, Nov 2024)

Approach

Architecture

Why It Failed

Summary

PR #19620 — "StreamExecuteRaw" (current, arthurschreiber, Mar 2026)

Approach

Architecture

Why It Succeeds

1. Complete bypass of vttablet-side parsing

2. Bulk chunking instead of per-packet framing

3. Pooled buffer eliminates allocation churn

4. Optimized I/O on vttablet

5. Zero-copy Feed on vtgate

6. New RPCs avoid backward-compatibility overhead

Comparative Summary

The Core Insight