efstathiosntonas/SKILL.md

Created December 6, 2025 09:22

Star (1) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/efstathiosntonas/8a3d77594831e6782696f5213aeec8c7.js"></script>
Save efstathiosntonas/8a3d77594831e6782696f5213aeec8c7 to your computer and use it in GitHub Desktop.

Download ZIP

Claude Skill from blog: https://goperf.dev/01-common-patterns/

Raw

compiler_optimization.md

Go-Performance - Compiler Optimization

Pages: 1

Leveraging Compiler Optimization Flags - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/comp-flags/

Contents:

Leveraging Compiler Optimization Flags in Go¶
Why Compiler Flags Matter¶
Key Compiler and Linker Flags¶
- -ldflags="-s -w" — Strip Debug Info¶
- -gcflags — Control Compiler Optimizations¶
- Cross-Compilation Flags¶
- Build Tags¶
- -ldflags="-X ..." — Inject Build-Time Variables¶
- -extldflags='-static' — Build Fully Static Binaries¶
  - Example: Static Build with libcurl via CGO¶

When tuning Go applications for performance, most of the attention goes to runtime behavior—profiling hot paths, trimming allocations, improving concurrency. But there’s another layer that’s easy to miss: what the Go compiler does with your code before it ever runs. The build process includes several optimization passes, and understanding how to surface or influence them can give you clearer insights into what’s actually happening under the hood. It’s not about tweaking obscure flags to squeeze out extra instructions—it’s about knowing how the compiler treats your code so you’re not working against it.

While Go doesn’t expose the same granular set of compiler flags as C or Rust, it still provides useful ways to influence how your code is built—especially when targeting performance, binary size, or specific environments.

Go's compiler (specifically cmd/compile and cmd/link) performs several default optimizations: inlining, escape analysis, dead code elimination, and more. However, there are scenarios where you can squeeze more performance or control from your build using the right flags.

When you want to shrink binary size, especially in production or containers:

Why it matters: This can reduce binary size by up to 30-40%, depending on your codebase. It is useful in Docker images or when distributing binaries.

The -gcflags flag allows you to control how the compiler treats specific packages. For example, you can disable optimizations for debugging:

When to use: During debugging sessions with Delve or similar tools. Turning off inlining and optimizations make stack traces and breakpoints more reliable.

Need to build for another OS or architecture?

Build tags allow conditional compilation. Use //go:build or // +build in your source code to control what gets compiled in.

You can inject version numbers or metadata into your binary at build time:

This sets the version variable at link time without modifying your source code. It's useful for embedding release versions, commit hashes, or build dates.

The -extldflags '-static' option passes the -static flag to the external system linker, instructing it to produce a fully statically linked binary.

This is especially useful when you're using CGO and want to avoid runtime dynamic library dependencies:

To go further and ensure your binary avoids relying on C library DNS resolution (such as glibc's getaddrinfo), you can use the netgo build tag. This forces Go to use its pure Go implementation of the DNS res

[Content truncated]

Examples:

Example 1 (unknown):

go build -ldflags="-s -w" -o app main.go

Example 2 (unknown):

go build -gcflags="all=-N -l" -o app main.go

Example 3 (unknown):

GOOS=linux GOARCH=arm64 go build -o app main.go

Example 4 (go):

//go:build debug

package main

import "log"

func debugLog(msg string) {
    log.Println("[DEBUG]", msg)
}

Raw

concurrency.md

Go-Performance - Concurrency

Pages: 5

Lazy Initialization - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/lazy-init/

Contents:

Lazy Initialization
Lazy Initialization for Performance in Go¶
- Why Lazy Initialization Matters¶
- Using sync.Once for Thread-Safe Initialization¶
- Using sync.OnceValue and sync.OnceValues for Initialization with Output Values¶
- Custom Lazy Initialization with Atomic Operations¶
- Performance Considerations¶
Benchmarking Impact¶
When to Choose Lazy Initialization¶

In Go, some resources are expensive to initialize, or simply unnecessary unless certain code paths are triggered. That’s where lazy initialization becomes useful: it defers the construction of a value until the moment it’s actually needed. This pattern can improve performance, reduce startup overhead, and avoid unnecessary work—especially in high-concurrency applications.

Initializing heavy resources like database connections, caches, or large in-memory structures at startup can slow down application launch and consume memory before it’s actually needed. Lazy initialization defers this work until the first time the resource is used, keeping startup fast and memory usage lean.

It’s also a practical pattern when you have logic that might be triggered multiple times but should only run once—ensuring that expensive operations aren’t repeated and that initialization remains safe and idempotent across concurrent calls.

Go provides the sync.Once type to implement lazy initialization safely in concurrent environments:

In this example, the function expensiveInit() executes exactly once, no matter how many goroutines invoke getResource() concurrently. This ensures thread-safe initialization without additional synchronization overhead.

Since Go 1.21, if your initialization logic returns a value, you might prefer using sync.OnceValue (single value) or sync.OnceValues (multiple values) for simpler, more expressive code:

Here, sync.OnceValue provides a concise way to wrap one-time initialization logic and access the result without managing flags or mutexes manually. It simplifies lazy loading by directly returning the computed value on demand.

For cases where the initializer returns more than one value—such as a resource and an error—sync.OnceValues extends the same idea. It ensures the function runs exactly once and cleanly unpacks the results, keeping the code readable and thread-safe without boilerplate.

Choosing sync.OnceValue or sync.OnceValues helps you clearly express initialization logic with direct value returns, whereas sync.Once remains best suited for general scenarios requiring flexible initialization logic without immediate value returns.

Yes, it’s technically possible to replace sync.Once, sync.OnceValue, or sync.OnceFunc with custom logic using low-level atomic operations like atomic.CompareAndSwap or atomic.Load/Store. In rare, performance-critical paths, this can avoid the small overhead or allocations that come with the standard types.

Howev

[Content truncated]

Examples:

Example 1 (go):

var (
    resource *MyResource
    once     sync.Once
)

func getResource() *MyResource {
    once.Do(func() {
        resource = expensiveInit()
    })
    return resource
}

Example 2 (go):

var getResource = sync.OnceValue(func() *MyResource {
    return expensiveInit()
})

func processData() {
    res := getResource()
    // use res
}

Example 3 (go):

var getConfig = sync.OnceValues(func() (*Config, error) {
    return loadConfig("config.yml")
})

func processData() {
    config, err := getConfig()
    if err != nil {
        log.Fatal(err)
    }
    // use config
}

Goroutine Worker Pools - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/worker-pool/

Contents:

Goroutine Worker Pools in Go¶
Why Worker Pools Matter¶
Basic Worker Pool Implementation¶
- Worker Count and CPU Cores¶
- Why Too Many Workers Hurts Performance¶
Benchmarking Impact¶
When To Use Worker Pools¶

Go’s concurrency model makes it deceptively easy to spin up thousands of goroutines—but that ease can come at a cost. Each goroutine starts small, but under load, unbounded concurrency can cause memory usage to spike, context switches to pile up, and overall performance to become unpredictable.

A worker pool helps apply backpressure by limiting the number of active goroutines. Instead of spawning one per task, a fixed pool handles work in controlled parallelism—keeping memory usage predictable and avoiding overload. This makes it easier to maintain steady performance even as demand scales.

While launching a goroutine for every task is idiomatic and often effective, doing so at scale comes with trade-offs. Each goroutine requires stack space and introduces scheduling overhead. Performance can degrade sharply when the number of active goroutines grows, especially in systems handling unbounded input like HTTP requests, jobs from a queue, or tasks from a channel.

A worker pool maintains a fixed number of goroutines that pull tasks from a shared job queue. This creates a backpressure mechanism, ensuring the system never processes more work concurrently than it can handle. Worker pools are particularly valuable when the cost of each task is predictable, and the overall system throughput needs to be stable.

Here’s a minimal implementation of a worker pool:

In this example, five workers pull from the jobs channel and push results to the results channel. The worker pool limits concurrency to five tasks at a time, regardless of how many tasks are sent.

The optimal number of workers in a pool is closely tied to the number of CPU cores, which you can obtain in Go using runtime.NumCPU() or runtime.GOMAXPROCS(0). For CPU-bound tasks—where each worker consumes substantial CPU time—you generally want the number of workers to be equal to or slightly less than the number of logical CPU cores. This ensures maximum core utilization without excessive overhead.

If your tasks are I/O-bound (e.g., network calls, disk I/O, database queries), the pool size can be larger than the number of cores. This is because workers will spend much of their time blocked, allowing others to run. In contrast, CPU-heavy workloads benefit from a smaller, tightly bounded pool that avoids contention and context switching.

Adding more workers can seem like a straightforward way to boost throughput, but the benefits taper off quickly past a certain point. Once you exceed the system’s optimal lev

[Content truncated]

Examples:

Example 1 (go):

func worker(id int, jobs <-chan int, results chan<- [32]byte) {
    for j := range jobs {
        results <- doWork(j)
    }
}

func doWork(n int) [32]byte {
    data := []byte(fmt.Sprintf("payload-%d", n))
    return sha256.Sum256(data)                  // (1)
}

func main() {
    jobs := make(chan int, 100)
    results := make(chan [32]byte, 100)

    for w := 1; w <= 5; w++ {
        go worker(w, jobs, results)
    }

    for j := 1; j <= 10; j++ {
        jobs <- j
    }
    close(jobs)

    for a := 1; a <= 10; a++ {
        <-results
    }
}

Example 2 (javascript):

package perf

import (
    // "log"
    "fmt"
    // "os"
    "runtime"
    "sync"
    "testing"
    "crypto/sha256"
)

const (
    numJobs     = 10000
    workerCount = 10
)

func doWork(n int) [32]byte {
    data := []byte(fmt.Sprintf("payload-%d", n))
    return sha256.Sum256(data)
}

func BenchmarkUnboundedGoroutines(b *testing.B) {
    for b.Loop() {
        var wg sync.WaitGroup
        wg.Add(numJobs)

        for j := 0; j < numJobs; j++ {
            go func(job int) {
                _ = doWork(job)
                wg.Done()
            }(j)
        }
        wg.Wait()
    }
}

func 
...

Efficient Context Management - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/context/

Contents:

Efficient Context Management¶
Why Context Matters¶
Practical Examples of Context Usage¶
- HTTP Server Request Cancellation¶
- Database Operations with Timeouts¶
- Propagating Request IDs for Distributed Tracing¶
- Concurrent Worker Management¶
- Graceful Shutdown in CLI Tools¶
- Streaming and Real-Time Data Pipelines¶
- Middleware and Rate Limiting¶

Whether you're handling HTTP requests, coordinating worker goroutines, or querying external services, there's often a need to cancel in-flight operations or enforce execution deadlines. Go’s context package is designed for precisely that—it provides a consistent and thread-safe way to manage operation lifecycles, propagate metadata, and ensure resources are cleaned up promptly.

Go provides two base context constructors: context.Background() and context.TODO().

The context package in Go is designed to carry deadlines, cancellation signals, and other request-scoped values across API boundaries. It's especially useful in concurrent programs where operations need to be coordinated and canceled cleanly.

A typical context workflow begins at the entry point of a program or request—like an HTTP handler, main function, or RPC server. From there, a base context is created using context.Background() or context.TODO(). This context can then be extended using constructors like:

Each of these functions returns a new context that wraps its parent. Cancellation signals, deadlines, and values are automatically propagated down the call stack. When a context is canceled—either manually or by timeout—any goroutines or functions listening on <-ctx.Done() are immediately notified.

By passing context explicitly through function parameters, you avoid hidden dependencies and gain fine-grained control over the execution lifecycle of concurrent operations.

The following examples show how context.Context enables better control, observability, and resource management across a variety of real-world scenarios.

Contexts help gracefully handle cancellations when clients disconnect early. Every incoming HTTP request in Go carries a context that gets canceled if the client closes the connection. By checking <-ctx.Done(), you can exit early instead of doing unnecessary work:

In this example, the handler waits for either a simulated delay or cancellation. If the client closes the connection before the timeout, ctx.Done() is triggered, allowing the handler to clean up without writing a response.

Contexts provide a straightforward way to enforce timeouts on database queries. Many drivers support QueryContext or similar methods that respect cancellation:

In this case, the context is automatically canceled if the database does not respond within two seconds. The query is aborted, and the application doesn’t hang indefinitely. This helps manage resources and avoids cascading failures in

[Content truncated]

Examples:

Example 1 (go):

func handler(w http.ResponseWriter, req *http.Request) {
    ctx := req.Context()
    select {
    case <-time.After(5 * time.Second):
        fmt.Fprintln(w, "Response after delay")
    case <-ctx.Done():
        log.Println("Client disconnected")
    }
}

Example 2 (unknown):

ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()

rows, err := db.QueryContext(ctx, "SELECT * FROM users")
if err != nil {
    log.Fatal(err)
}
defer rows.Close()

Example 3 (go):

func main() {
    ctx := context.WithValue(context.Background(), "requestID", "12345")
    handleRequest(ctx)
}

func handleRequest(ctx context.Context) {
    log.Printf("Handling request with ID: %v", ctx.Value("requestID"))
}

Example 4 (unknown):

ctx, cancel := context.WithCancel(context.Background())

for i := 0; i < 10; i++ {
    go worker(ctx, i)
}

// Cancel workers after some condition or signal
cancel()

Immutable Data Sharing - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/immutable-data/

Contents:

Immutable Data Sharing¶
Why Immutable Data?¶
Practical Example: Shared Config¶
- Step 1: Define the Config Struct¶
- Step 2: Ensure Deep Immutability¶
- Step 3: Atomic Swapping¶
- Step 4: Using It in Handlers¶
Practical Example: Immutable Routing Table¶
- Step 1: Define Route Structs¶
- Step 2: Build Immutable Version¶

One common source of slowdown in high-performance Go programs is the way shared data is accessed under concurrency. The usual tools—mutexes and channels—work well, but they’re not free. Mutexes can become choke points if many goroutines try to grab the same lock. Channels, while elegant for coordination, can introduce blocking and make control flow harder to reason about. Both require careful use: it’s easy to introduce subtle bugs or unexpected performance issues if synchronization isn’t tight.

A powerful alternative is immutable data sharing. Instead of protecting data with locks, you design your system so that shared data is never mutated after it's created. This minimizes contention and simplifies reasoning about your program.

Immutability brings several advantages to concurrent programs:

Imagine you have a long-running service that periodically reloads its configuration from a disk or a remote source. Multiple goroutines read this configuration to make decisions.

Here's how immutable data helps:

Maps and slices in Go are reference types. Even if the Config struct isn't changed, someone could accidentally mutate a shared map. To prevent this, we make defensive copies:

Now, every config instance is self-contained and safe to share.

Use atomic.Value to store and safely update the current config.

Now all goroutines can safely call GetConfig() with no locks. When the config is reloaded, you just Store a new immutable copy.

Suppose you're building a lightweight reverse proxy or API gateway and must route incoming requests based on path or host. The routing table is read thousands of times per second and updated only occasionally (e.g., from a config file or service discovery).

To ensure immutability, we deep-copy the slice of routes when constructing a new routing table.

Now, your routing logic can scale safely under load with zero locking overhead.

As systems grow, routing tables can expand to hundreds or even thousands of entries. While immutability brings clear benefits—safe concurrent access, predictable behavior—it becomes costly if every update means copying the entire structure. At some point, rebuilding the whole table for each minor change doesn’t scale.

To keep immutability without paying for full reconstruction on every update, the design needs to evolve. There are several ways to do this—each preserving the core benefits while reducing overhead.

Imagine a multi-tenant system where each customer has their own set of routing rules. I

[Content truncated]

Examples:

Example 1 (unknown):

// config.go
type Config struct {
    LogLevel string
    Timeout  time.Duration
    Features map[string]bool // This needs attention!
}

Example 2 (go):

func NewConfig(logLevel string, timeout time.Duration, features map[string]bool) *Config {
    copiedFeatures := make(map[string]bool, len(features))
    for k, v := range features {
        copiedFeatures[k] = v
    }

    return &Config{
        LogLevel: logLevel,
        Timeout:  timeout,
        Features: copiedFeatures,
    }
}

Example 3 (go):

var currentConfig atomic.Pointer[Config]

func LoadInitialConfig() {
    cfg := NewConfig("info", 5*time.Second, map[string]bool{"beta": true})
    currentConfig.Store(cfg)
}

func GetConfig() *Config {
    return currentConfig.Load()
}

Example 4 (go):

func handler(w http.ResponseWriter, r *http.Request) {
    cfg := GetConfig()
    if cfg.Features["beta"] {
        // Enable beta path
    }
    // Use cfg.Timeout, cfg.LogLevel, etc.
}

Atomic Operations and Synchronization Primitives - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/atomic-ops/

Contents:

Atomic Operations and Synchronization Primitives¶
Understanding Atomic Operations¶
- Memory Model and Comparison to C++¶
- Common Atomic Operations¶
- When to Use Atomic Operations in Real Life¶
  - High-throughput metrics and Counters¶
  - Fast, Lock-Free Flags¶
  - Once-Only Initialization¶
  - Lock-Free Queues or Freelist Structures¶
  - Reducing Lock Contention¶

In high-concurrency systems, performance isn't just about what you do—it's about what you avoid. Lock contention, cache line bouncing and memory fences quietly shape throughput long before you hit your scaling ceiling. Atomic operations are among the leanest tools Go offers to sidestep these pitfalls.

While Go provides the full suite of synchronization primitives, there's a class of problems where locks feel like overkill. Atomics offers clarity and speed for low-level coordination—counters, flags, and simple state machines, especially under pressure.

Atomic operations allow safe concurrent access to shared data without explicit locking mechanisms like mutexes. The sync/atomic package provides low-level atomic memory primitives ideal for counters, flags, or simple state transitions.

The key benefit of atomic operations is performance under contention. Locking introduces coordination overhead—when many goroutines contend for a mutex, performance can degrade due to context switching and lock queue management. Atomics avoids this by operating directly at the hardware level using CPU instructions like CAS (compare-and-swap). This makes them particularly useful for:

Understanding memory models is crucial when reasoning about concurrency. In C++, developers have fine-grained control over atomic operations via memory orderings, which allows them to trade-off between performance and consistency. By default, Go's atomic operations enforce sequential consistency, which means they behave like std::memory_order_seq_cst in C++. This is the strongest and safest memory ordering:

Go does not expose weaker memory models like relaxed, acquire, or release. This is an intentional simplification to promote safety and reduce the risk of subtle data races. All atomic operations in Go imply synchronization across goroutines, ensuring correct behavior without manual memory fencing.

This means you don’t have to reason about instruction reordering or memory visibility at a low level—but it also means you can’t fine-tune for performance in the way C++ or Rust developers might use relaxed atomics.

Low-level access to relaxed memory ordering in Go exists internally (e.g., in the runtime or through go:linkname), but it’s not safe or supported for use in application-level code.

Tracking request counts, dropped packets, or other lightweight stats:

This code allows multiple goroutines to safely increment a shared counter without using locks. atomic.AddInt64 ensures each addition i

[Content truncated]

Examples:

Example 1 (go):

var requests atomic.Int64

func handleRequest() {
    requests.Add(1)
}

Example 2 (go):

var shutdown atomic.Int32

func mainLoop() {
    for {
        if shutdown.Load() == 1 {
            break
        }
        // do work
    }
}

func stop() {
    shutdown.Store(1)
}

Example 3 (go):

import (
    "runtime"
    "sync/atomic"
    "unsafe"
)

var resource unsafe.Pointer
var initStatus int32 // 0: not started, 1: in progress, 2: completed

func getResource() *MyResource {
    if atomic.LoadInt32(&initStatus) == 2 {
        return (*MyResource)(atomic.LoadPointer(&resource))
    }

    if atomic.CompareAndSwapInt32(&initStatus, 0, 1) {
        newRes := expensiveInit() // initialization logic
        atomic.StorePointer(&resource, unsafe.Pointer(newRes))
        atomic.StoreInt32(&initStatus, 2)
        return newRes
    }

    for atomic.LoadInt32(&initStatus) != 2 {
        r
...

Example 4 (go):

type node struct {
    next *node
    val  any
}

var head atomic.Pointer[node]

func push(n *node) {
    for {
        old := head.Load()
        n.next = old
        if head.CompareAndSwap(old, n) {
            return
        }
    }
}

Raw

escape_analysis.md

Go-Performance - Escape Analysis

Pages: 1

Stack Allocations and Escape Analysis - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/stack-alloc/

Contents:

Stack Allocations and Escape Analysis¶
What Is Escape Analysis?¶
- Why does it matter?¶
- Example: Stack vs Heap¶
How to View Escape Analysis Output¶
What Causes Variables to Escape?¶
- Returning Pointers to Local Variables¶
- Capturing Variables in Closures¶
- Interface Conversions¶
- Assignments to Global Variables or Struct Fields¶

When writing performance-critical Go applications, one of the subtle but significant optimizations you can make is encouraging values to be allocated on the stack rather than the heap. Stack allocations are cheaper, faster, and garbage-free—but Go doesn't always put your variables there automatically. That decision is made by the Go compiler during escape analysis.

In this article, we’ll explore what escape analysis is, how to read the compiler’s escape diagnostics, what causes values to escape, and how to structure your code to minimize unnecessary heap allocations. We'll also benchmark different scenarios to show the real-world impact.

Escape analysis is a static analysis performed by the Go compiler to determine whether a variable can be safely allocated on the stack or if it must be moved ("escape") to the heap.

The compiler decides where to place each variable based on how it's used. If a variable can be guaranteed to not outlive its declaring function, it can stay on the stack. If not, it escapes to the heap.

In allocate, x is returned as a pointer. Since the pointer escapes the function, the Go compiler places x on the heap. In noEscape, x is a plain value and doesn’t escape.

You can inspect escape analysis with the -gcflags compiler option:

Or for a specific file:

This will print lines like:

Look for messages like moved to heap to identify escape points.

Here are common scenarios that force heap allocation:

When a value is stored in an interface, it may escape:

Go may allocate large structs or slices on the heap even if they don’t strictly escape.

Let’s run a benchmark to explore when heap allocations actually occur—and when they don’t, even if we return a pointer.

You might expect HeapAlloc to always allocate memory on the heap—but it doesn’t here. That’s because the compiler is smart: in this isolated benchmark, the pointer returned by HeapAlloc doesn’t escape the function in any meaningful way. The compiler can see it’s only used within the benchmark and short-lived, so it safely places it on the stack too.

As shown in BenchmarkHeapAllocEscape, assigning the pointer to a global variable causes a real heap escape. This introduces real overhead: a 40x slower call, a 24-byte allocation, and one garbage-collected object per call.

Not all escapes are worth preventing. Here’s when it makes sense to focus on stack allocation—and when it’s better to let values escape.

When It’s Fine to Let Values Escape

Examples:

Example 1 (go):

func allocate() *int {
    x := 42
    return &x // x escapes to the heap
}

func noEscape() int {
    x := 42
    return x // x stays on the stack
}

Example 2 (unknown):

go build -gcflags="-m" ./path/to/pkg

Example 3 (unknown):

go run -gcflags="-m" main.go

Example 4 (unknown):

main.go:10:6: moved to heap: x
main.go:14:6: can inline noEscape

Raw

garbage_collector.md

Go-Performance - Garbage Collector

Pages: 1

Memory Efficiency and Go’s Garbage Collector - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/gc/

Contents:

Memory Efficiency: Mastering Go’s Garbage Collector¶
How Go's Garbage Collector Works¶
- Non-generational¶
- Concurrent¶
- Tri-color Mark and Sweep¶
GC Tuning: GOGC¶
- Memory Limiting with GOMEMLIMIT¶
- GOMEMLIMIT=X and GOGC=off configuration¶
Practical Strategies for Reducing GC Pressure¶
- Prefer Stack Allocation¶

Memory management in Go is automated—but it’s not invisible. Every allocation you make contributes to GC workload. The more frequently objects are created and discarded, the more work the runtime has to do reclaiming memory.

This becomes especially relevant in systems prioritizing low latency, predictable resource usage, or high throughput. Tuning your allocation patterns and leveraging newer features like weak references can help reduce pressure on the GC without adding complexity to your code.

Highly encourage you to read the official A Guide to the Go Garbage Collector! The document provides a detailed description of multiple Go's GC internals.

Go uses a non-generational, concurrent, tri-color mark-and-sweep garbage collector. Here's what that means in practice and how it's implemented.

Many modern GCs, like those in the JVM or .NET CLR, divide memory into generations (young and old) under the assumption that most objects die young. These collectors focus on the young generation, which leads to shorter collection cycles.

Go’s GC takes a different approach. It treats all objects equally—no generational segmentation—not because generational GC conflicts with short pause times or concurrent scanning, but because it hasn’t shown clear, consistent benefits in real-world Go programs with the designs tried so far. This choice avoids the complexity of promotion logic and specialized memory regions. While it can mean scanning more objects overall, this cost is mitigated by concurrent execution and efficient write barriers.

Go’s GC runs concurrently with your application, which means it does most of its work without stopping the world. Concurrency is implemented using multiple phases that interleave with normal program execution:

Even though Go’s garbage collector is mostly concurrent, it still requires brief Stop-The-World (STW) pauses at several points to maintain correctness. These pauses are kept extremely short—typically under 100 microseconds—even with large heaps and hundreds of goroutines.

STW is essential for ensuring that memory structures are not mutated while the GC analyzes them. In most applications, these pauses are imperceptible. However, even sub-millisecond pauses in latency-sensitive systems can be significant—so understanding and monitoring STW behavior becomes important when optimizing for tail latencies or jitter.

Write barriers ensure correctness while the application mutates objects during concurrent marking. These barriers help t

[Content truncated]

Examples:

Example 1 (unknown):

GOGC=100  # Default: GC runs when heap grows 100% since last collection
GOGC=off  # Disables GC (use only in special cases like short-lived CLI tools)

Example 2 (unknown):

GOMEMLIMIT=400MiB

Example 3 (unknown):

import "runtime/debug"

debug.SetMemoryLimit(2 << 30) // 2 GiB

Example 4 (unknown):

GOGC=100 GOMEMLIMIT=4GiB ./your-service

Raw

index.md

Go-Performance Documentation Index

Go-Performance - Io Optimization

Pages: 2

Batching Operations - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/batching-ops/

Contents:

Batching Operations in Go¶
Why Batching Matters¶
How generic Batcher may looks like¶
Benchmarking Impact¶
When To Use Batching¶

Batching is one of those techniques that’s easy to overlook but incredibly useful when performance starts to matter. Instead of handling one operation at a time, you group them together—cutting down on the overhead of repeated calls, whether that’s hitting the network, writing to disk, or making a database commit. It’s a practical, low-complexity approach that can reduce latency and stretch your system’s throughput further than you’d expect.

Most systems don’t struggle because individual operations are too slow—they struggle because they do too many of them. Every call out to a database, API, or filesystem adds some fixed cost: a system call, a network round trip, maybe a lock or a context switch. When those costs add up across high-volume workloads, the impact is hard to ignore. Batching helps by collapsing those calls into fewer, more efficient units of work, which often leads to measurable gains in both performance and resource usage.

Consider a logging service writing to disk:

When invoked thousands of times per second, the file system is inundated with individual write system calls, significantly degrading performance. A better approach could be aggregates log entries and flushes them in bulk:

With batching, each write operation handles multiple entries simultaneously, reducing syscall overhead and improving disk I/O efficiency.

While batching offers substantial performance advantages, it also introduces the risk of data loss. If an application crashes before a batch is flushed, the in-memory data can be lost. Systems dealing with critical or transactional data must incorporate safeguards such as periodic flushes, persistent storage buffers, or recovery mechanisms to mitigate this risk.

We can implement a generic batcher in very straight forward manner:

This batcher implementation expects that you will never call Batcher.Add(...) from your flush() function. We have this limitation because Go mutexes are not recursive.

This batcher works with any data type, making it a flexible solution for aggregating logs, metrics, database writes, or other grouped operations. Internally, the buffer acts as a queue that accumulates items until a flush threshold is reached. The use of sync.Mutex ensures that Add() and flushNow() are safe for concurrent access, which is necessary in most real-world systems where multiple goroutines may write to the batcher.

From a performance standpoint, it's true that a lock-free implementation—using atomic operations or conc

[Content truncated]

Examples:

Example 1 (unknown):

func logLine(line string) {
    f.WriteString(line + "\n")
}

Example 2 (go):

var batch []string

func logBatch(line string) {
    batch = append(batch, line)
    if len(batch) >= 100 {
        f.WriteString(strings.Join(batch, "\n") + "\n")
        batch = batch[:0]
    }
}

Example 3 (go):

type Batcher[T any] struct {
    mu     sync.Mutex
    buffer []T
    size   int
    flush  func([]T)
}

func NewBatcher[T any](size int, flush func([]T)) *Batcher[T] {
    return &Batcher[T]{
        buffer: make([]T, 0, size),
        size:   size,
        flush:  flush,
    }
}

func (b *Batcher[T]) Add(item T) {
    b.mu.Lock()
    defer b.mu.Unlock()
    b.buffer = append(b.buffer, item)
    if len(b.buffer) >= b.size {
        b.flushNow()
    }
}

func (b *Batcher[T]) flushNow() {
    if len(b.buffer) == 0 {
        return
    }
    b.flush(b.buffer)
    b.buffer = b.buffer[:0]
}

Example 4 (go):

package perf

import (
    "crypto/sha256"
    "encoding/hex"
    "fmt"
    "os"
    "strings"
    "testing"
)

var lines = make([]string, 10000)

func init() {
    for i := range lines {
        lines[i] = fmt.Sprintf("log entry %d %s", i, strings.Repeat("x", 100))
    }
}

// --- 1. No I/O ---

func BenchmarkUnbatchedProcessing(b *testing.B) {
    for b.Loop() {
        for _, line := range lines {
            strings.ToUpper(line)
        }
    }
}

func BenchmarkBatchedProcessing(b *testing.B) {
    batchSize := 100
    for b.Loop() {
        for i := 0; i < len(lines); i += batchSize {
  
...

Efficient Buffering - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/buffered-io/

Contents:

Efficient Buffering in Go¶
Why Buffering Matters¶
- With Buffering¶
- Controlling Buffer Capacity¶
Benchmarking Impact¶
When To Buffer¶

Buffering is a core performance technique in systems programming. In Go, it's especially relevant when working with I/O—file access, network communication, and stream processing. Without buffering, many operations incur excessive system calls or synchronization overhead. Proper buffering reduces the frequency of such interactions, improves throughput, and smooths latency spikes.

Every time you read from or write to a file or socket, there’s a good chance you’re triggering a system call—and that’s not cheap. System calls move control from user space into kernel space, which means crossing a boundary that comes with overhead: entering kernel mode, possible context switches, interacting with I/O buffers, and sometimes queuing operations behind the scenes. Doing that once in a while is fine. Doing it thousands of times per second? That’s a problem. Buffering helps by batching small reads or writes into larger chunks, reducing how often you cross that boundary and making far better use of each syscall.

For example, writing to a file in a loop without buffering, like this:

This can easily result in 10,000 separate system calls, each carrying its own overhead and dragging down performance. On top of that, a flood of small writes tends to fragment disk operations, which puts extra pressure on I/O subsystems and wastes CPU cycles handling what could have been a single, efficient batch.

This version significantly reduces the number of system calls. The bufio.Writer accumulates writes in an internal memory buffer (typically 4KB or more). It only triggers a syscall when the buffer is full or explicitly flushed. As a result, you achieve faster I/O, reduced CPU usage, and improved performance.

bufio.Writer does not automatically flush when closed. If you forget to call Flush(), any unwritten data remaining in the buffer will be lost. Always call Flush() before closing or returning from a function, especially if the total written size is smaller than the buffer capacity.

By default, bufio.NewWriter() allocates a 4096-byte (4 KB) buffer. This size aligns with the common block size of file systems and the standard memory page size on most operating systems (such as Linux, BSD, and macOS). Reading or writing in 4 KB increments minimizes page faults, aligns with kernel read-ahead strategies, and maps efficiently onto underlying disk I/O operations.

While 4 KB is a practical general-purpose default, it might not be optimal for all workloads. For high-throughput scenari

[Content truncated]

Examples:

Example 1 (unknown):

f, _ := os.Create("output.txt")
for i := 0; i < 10000; i++ {
    f.Write([]byte("line\n"))
}

Example 2 (unknown):

f, _ := os.Create("output.txt")
buf := bufio.NewWriter(f)
for i := 0; i < 10000; i++ {
    buf.WriteString("line\n")
}
buf.Flush() // ensure all buffered data is written

Example 3 (unknown):

f, _ := os.Create("output.txt")
buf := bufio.NewWriterSize(f, 16*1024) // 16 KB buffer

Example 4 (unknown):

reader := bufio.NewReaderSize(f, 32*1024) // 32 KB buffer for input

Raw

memory_management.md

Go-Performance - Memory Management

Pages: 6

Memory Preallocation - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/mem-prealloc/

Contents:

Memory Preallocation¶
Why Preallocation Matters¶
Practical Preallocation Examples¶
- Slice Preallocation¶
- Map Preallocation¶
Benchmarking Impact¶
When To Preallocate¶

Memory preallocation is a simple but effective way to improve performance in Go programs that work with slices or maps that grow over time. Instead of letting the runtime resize these structures as they fill up—often at unpredictable points—you allocate the space you need upfront. This avoids the cost of repeated allocations, internal copying, and extra GC pressure as intermediate objects are created and discarded.

In high-throughput or latency-sensitive systems, preallocating memory makes execution more predictable and helps avoid performance cliffs that show up under load. If the workload size is known or can be reasonably estimated, there’s no reason to let the allocator do the guessing.

Go’s slices and maps grow automatically as new elements are added, but that convenience comes with a cost. When capacity is exceeded, the runtime allocates a larger backing array or hash table and copies the existing data over. This reallocation adds memory pressure, burns CPU cycles, and can stall tight loops in high-throughput paths. In performance-critical code—especially where the size is known or can be estimated—frequent resizing is unnecessary overhead. Preallocating avoids these penalties by giving the runtime enough room to work without interruption.

Go uses a hybrid growth strategy for slices to balance speed and memory efficiency. Early on, capacities double with each expansion—2, 4, 8, 16—minimizing the number of allocations. But once a slice exceeds around 1024 elements, the growth rate slows to roughly 25%. So instead of jumping from 1024 to 2048, the next allocation might grow to about 1280.

This shift reduces memory waste on large slices but increases the frequency of allocations if the final size is known but not preallocated. In those cases, using make([]T, 0, expectedSize) is the more efficient choice—it avoids repeated resizing and cuts down on unnecessary copying.

Output illustrating typical growth:

Without preallocation, each append operation might trigger new allocations:

This pattern causes Go to allocate larger underlying arrays repeatedly as the slice grows, resulting in memory copying and GC pressure. We can avoid that by using make with a specified capacity:

If it is known that the slice will be fully populated, we can be even more efficient by avoiding bounds checks:

Maps grow similarly. By default, Go doesn’t know how many elements you’ll add, so it resizes the underlying structure as needed.

Starting with Go 1.11, you can preallo

[Content truncated]

Examples:

Example 1 (unknown):

s := make([]int, 0)
for i := 0; i < 10_000; i++ {
    s = append(s, i)
    fmt.Printf("Len: %d, Cap: %d\n", len(s), cap(s))
}

Example 2 (unknown):

Len: 1, Cap: 1
Len: 2, Cap: 2
Len: 3, Cap: 4
Len: 5, Cap: 8
...
Len: 1024, Cap: 1024
Len: 1025, Cap: 1280

Example 3 (unknown):

// Inefficient
var result []int
for i := 0; i < 10000; i++ {
    result = append(result, i)
}

Example 4 (unknown):

// Efficient
result := make([]int, 0, 10000)
for i := 0; i < 10000; i++ {
    result = append(result, i)
}

Zero-Copy Techniques - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/zero-copy/

Contents:

Zero-Copy Techniques¶
Understanding Zero-Copy¶
Common Zero-Copy Techniques in Go¶
- Using io.Reader and io.Writer Interfaces¶
- Slicing for Efficient Data Access¶
- Memory Mapping (mmap)¶
Benchmarking Impact¶
- File I/O: Memory Mapping vs. Standard Read¶
When to Use Zero-Copy¶
- Real-World Use Cases and Libraries¶

When writing performance-critical Go code, how memory is managed often has a bigger impact than it first appears. Zero-copy techniques are one of the more effective ways to tighten that control. Instead of moving bytes from buffer to buffer, these techniques work directly on existing memory—avoiding copies altogether. That means less pressure on the CPU, better cache behavior, and fewer GC-triggered pauses. For I/O-heavy systems—whether you’re streaming files, handling network traffic, or parsing large datasets—this can translate into much higher throughput and lower latency without adding complexity.

In the usual I/O path, data moves back and forth between user space and kernel space—first copied into a kernel buffer, then into your application’s buffer, or the other way around. It works, but it’s wasteful. Every copy burns CPU cycles and clogs up memory bandwidth. Zero-copy changes that. Instead of bouncing data between buffers, it lets applications work directly with what’s already in place—no detours, no extra copies. The result? Lower CPU load, better use of memory, and faster I/O, especially when throughput or latency actually matter.

Using interfaces like io.Reader and io.Writer gives you fine-grained control over how data flows. Instead of spinning up new buffers every time, you can reuse existing ones and keep memory usage steady. In practice, this avoids unnecessary garbage collection pressure and keeps your I/O paths clean and efficient—especially when you’re dealing with high-throughput or streaming workloads.

io.CopyBuffer reuses a provided buffer, avoiding repeated allocations and intermediate copies. An in-depth io.CopyBuffer explanation is available on SO.

Slicing large byte arrays or buffers instead of copying data into new slices is a powerful zero-copy strategy:

Slices in Go are inherently zero-copy since they reference the underlying array.

Using memory mapping enables direct access to file contents without explicit read operations:

This approach maps file contents directly into memory, entirely eliminating copying between kernel and user-space.

Here's a basic benchmark illustrating performance differences between explicit copying and zero-copy slicing:

In BenchmarkCopy, each iteration copies a 64KB buffer into a fresh slice—allocating memory and duplicating data every time. That cost adds up fast. BenchmarkSlice, on the other hand, just re-slices the same buffer—no allocation, no copying, just new view on the same data. The di

[Content truncated]

Examples:

Example 1 (go):

func StreamData(src io.Reader, dst io.Writer) error {
    buf := make([]byte, 4096) // Reusable buffer
    _, err := io.CopyBuffer(dst, src, buf)
    return err
}

Example 2 (unknown):

func process(buffer []byte) []byte {
    return buffer[128:256] // returns a slice reference without copying
}

Example 3 (go):

import "golang.org/x/exp/mmap"

func ReadFileZeroCopy(path string) ([]byte, error) {
    r, err := mmap.Open(path)
    if err != nil {
        return nil, err
    }
    defer r.Close()

    data := make([]byte, r.Len())
    _, err = r.ReadAt(data, 0)
    return data, err
}

Example 4 (go):

func BenchmarkCopy(b *testing.B) {
    data := make([]byte, 64*1024)
    for b.Loop() {
        buf := make([]byte, len(data))
        copy(buf, data)
    }
}

func BenchmarkSlice(b *testing.B) {
    data := make([]byte, 64*1024)
    for b.Loop() {
        _ = data[:]
    }
}

Object Pooling - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/object-pooling/

Contents:

Object Pooling¶
How Object Pooling Works¶
- Using sync.Pool for Object Reuse¶
  - Without Object Pooling (Inefficient Memory Usage)¶
  - With Object Pooling (Optimized Memory Usage)¶
- Pooling Byte Buffers for Efficient I/O¶
Benchmarking Impact¶
When Should You Use sync.Pool?¶

Object pooling helps reduce allocation churn in high-throughput Go programs by reusing objects instead of allocating fresh ones each time. This avoids repeated work for the allocator and eases pressure on the garbage collector, especially when dealing with short-lived or frequently reused structures.

Go’s sync.Pool provides a built-in way to implement pooling with minimal code. It’s particularly effective for objects that are expensive to allocate or that would otherwise contribute to frequent garbage collection cycles. While not a silver bullet, it’s a low-friction tool that can lead to noticeable gains in latency and CPU efficiency under sustained load.

Object pooling allows programs to reuse memory by recycling previously allocated objects instead of creating new ones on every use. Rather than hitting the heap each time, objects are retrieved from a shared pool and returned once they’re no longer needed. This reduces the number of allocations, cuts down on garbage collection workload, and leads to more predictable performance—especially in workloads with high object churn or tight latency requirements.

In the above example, every iteration creates a new Data instance, leading to unnecessary allocations and increased GC pressure.

Object pooling is especially effective when working with large byte slices that would otherwise lead to high allocation and garbage collection overhead.

Using sync.Pool for byte buffers significantly reduces memory pressure when dealing with high-frequency I/O operations.

To prove that object pooling actually reduces allocations and improves speed, we can use Go's built-in memory profiling tools (pprof) and compare memory allocations between the non-pooled and pooled versions. Simulating a full-scale application that actively uses memory for benchmarking is challenging, so we need a controlled test to evaluate direct heap allocations versus pooled allocations.

The benchmark results highlight the contrast in performance and memory usage between direct allocations and object pooling. In BenchmarkWithoutPooling, each iteration creates a new object on the heap, leading to higher execution time and increased memory consumption. This constant allocation pressure triggers more frequent garbage collection, which adds latency and reduces throughput. The presence of nonzero allocation counts per operation confirms that each iteration contributes to GC load, making this approach less efficient in high-throughput scenarios.

Avoid sy

[Content truncated]

Examples:

Example 1 (go):

package main

import (
    "fmt"
)

type Data struct {
    Value int
}

func createData() *Data {
    return &Data{Value: 42}
}

func main() {
    for i := 0; i < 1000000; i++ {
        obj := createData() // Allocating a new object every time
        _ = obj // Simulate usage
    }
    fmt.Println("Done")
}

Example 2 (python):

package main

import (
    "fmt"
    "sync"
)

type Data struct {
    Value int
}

var dataPool = sync.Pool{
    New: func() any {
        return &Data{}
    },
}

func main() {
    for i := 0; i < 1000000; i++ {
        obj := dataPool.Get().(*Data) // Retrieve from pool
        obj.Value = 42 // Use the object
        dataPool.Put(obj) // Return object to pool for reuse
    }
    fmt.Println("Done")
}

Example 3 (go):

package main

import (
    "bytes"
    "fmt"
    "sync"
)

var bufferPool = sync.Pool{
    New: func() any {
        return new(bytes.Buffer)
    },
}

func main() {
    buf := bufferPool.Get().(*bytes.Buffer)
    buf.Reset()
    buf.WriteString("Hello, pooled world!")
    fmt.Println(buf.String())
    bufferPool.Put(buf) // Return buffer to pool for reuse
}

Example 4 (python):

package perf

import (
    "sync"
    "testing"
)

// Data is a struct with a large fixed-size array to simulate a memory-intensive object.
type Data struct {
    Values [1024]int
}

// BenchmarkWithoutPooling measures the performance of direct heap allocations.
func BenchmarkWithoutPooling(b *testing.B) {
    for b.Loop() {
        data := &Data{}      // Allocating a new object each time
        data.Values[0] = 42  // Simulating some memory activity
    }
}

// dataPool is a sync.Pool that reuses instances of Data to reduce memory allocations.
var dataPool = sync.Pool{
    New: func() any {
...

Avoiding Interface Boxing - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/interface-boxing/

Contents:

Avoiding Interface Boxing¶
What is Interface Boxing?¶
Why It Matters¶
Benchmarking Impact¶
- Boxing Large Structs¶
- Passing to a Function That Accepts an Interface¶
When Interface Boxing Is Acceptable¶
- When abstraction is more important than performance¶
- When values are small and boxing is allocation-free¶
- When values are short-lived¶

Go’s interfaces make it easy to write flexible, decoupled code. But behind that convenience is a detail that can trip up performance: when a concrete value is assigned to an interface, Go wraps it in a hidden structure—a process called interface boxing.

In many cases, boxing is harmless. But in performance-sensitive code—like tight loops, hot paths, or high-throughput services—it can introduce hidden heap allocations, extra memory copying, and added pressure on the garbage collector. These effects often go unnoticed during development, only showing up later as latency spikes or memory bloat.

Interface boxing refers to the process of converting a concrete value to an interface type. In Go, an interface value is internally represented as two words:

When you assign a value to an interface variable, Go creates this two-part structure. If the value is a non-pointer type—like a struct or primitive—and is not already on the heap, Go may allocate a copy of it on the heap to satisfy the interface assignment. This behavior is especially relevant when working with large values or when storing items in a slice of interfaces, where each element gets individually boxed. These implicit allocations can add up and are a common source of hidden memory pressure in Go programs.

Here’s a simple example:

In this case, the integer 42 is boxed into an interface: Go stores the type information (int) and a copy of the value 42. This is inexpensive for small values like int, but for large structs, the cost becomes non-trivial.

Pay attention to this code! In this example, even though shapes is a slice of interfaces, each Square value is copied into an interface when appended to shapes. If Square were a large struct, this would introduce 1000 allocations and large memory copying.

To avoid that, you could pass pointers:

This way, only an 8-byte pointer is stored in the interface, reducing both allocation size and copying overhead.

In tight loops or high-throughput paths, such as unmarshalling JSON, rendering templates, or processing large collections, interface boxing can degrade performance by triggering unnecessary heap allocations and increasing GC pressure. This overhead is especially costly in systems with high concurrency or real-time responsiveness constraints.

Boxing can also make profiling and benchmarking misleading, since allocations attributed to innocuous-looking lines may actually stem from implicit conversions to interfaces.

For the benchmarking we will define

[Content truncated]

Examples:

Example 1 (unknown):

var i interface{}
i = 42

Example 2 (go):

type Shape interface {
    Area() float64
}

type Square struct {
    Size float64
}

func (s Square) Area() float64 { return s.Size * s.Size }

func main() {
    var shapes []Shape
    for i := 0; i < 1000; i++ {
        s := Square{Size: float64(i)}
        shapes = append(shapes, s) // boxing occurs here
    }
}

Example 3 (unknown):

shapes = append(shapes, &s) // avoids large struct copy

Example 4 (unknown):

type Worker interface {
    Work()
}

type LargeJob struct {
    payload [4096]byte
}

func (LargeJob) Work() {}

Struct Field Alignment - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/fields-alignment/

Contents:

Struct Field Alignment¶
Why Alignment Matters¶
Benchmarking Impact¶
Avoiding False Sharing in Concurrent Workloads¶
When To Align Structs¶

When optimizing Go programs for performance, struct layout and memory alignment often go unnoticed—yet they have a measurable impact on memory usage and cache efficiency. Go automatically aligns struct fields based on platform-specific rules, inserting padding to satisfy alignment constraints. Understanding and controlling memory alignment isn’t just a low-level detail—it can have a real impact on how your Go programs perform, especially in tight loops or high-throughput systems. Proper alignment can reduce the overall memory footprint, make better use of CPU caches, and eliminate subtle performance penalties that add up under load.

Modern CPUs are tuned for predictable memory access. When struct fields are misaligned or split across cache lines, the processor often has to do extra work to fetch the data. That can mean additional memory cycles, more cache misses, and slower performance overall. These costs are easy to overlook in everyday code but show up quickly in code that’s sensitive to throughput or latency. In Go, struct fields are aligned according to their type requirements, and the compiler inserts padding bytes to meet these constraints. If fields are arranged without care, unnecessary padding may inflate struct size significantly, affecting memory use and bandwidth.

Consider the following two structs:

On a 64-bit system, PoorlyAligned requires 24 bytes due to the padding between fields, whereas WellAligned fits into 16 bytes by ordering fields from largest to smallest alignment requirement.

We benchmarked both struct layouts by allocating 10 million instances of each and measuring allocation time and memory usage:

In a test with 10 million structs, the WellAligned version used 80MB less memory than its poorly aligned counterpart—and it also ran a bit faster. This isn’t just about saving RAM; it shows how struct layout directly affects allocation behavior and memory bandwidth. When you’re working with large volumes of data or performance-critical paths, reordering fields for better alignment can lead to measurable gains with minimal effort.

In addition to memory layout efficiency, struct alignment also plays a crucial role in concurrent systems. When multiple goroutines access different fields of the same struct that reside on the same CPU cache line, they may suffer from false sharing—where changes to one field cause invalidations in the other, even if logically unrelated.

On modern CPUs, a typical cache line is 64 bytes wide. When a stru

[Content truncated]

Examples:

Example 1 (unknown):

type PoorlyAligned struct {
    flag bool
    count int64
    id byte
}

type WellAligned struct {
    count int64
    flag bool
    id byte
}

Example 2 (go):

func BenchmarkPoorlyAligned(b *testing.B) {
    for b.Loop() {
        var items = make([]PoorlyAligned, 10_000_000)
        for j := range items {
            items[j].count = int64(j)
        }
    }
}

func BenchmarkWellAligned(b *testing.B) {
    for b.Loop() {
        var items = make([]WellAligned, 10_000_000)
        for j := range items {
            items[j].count = int64(j)
        }
    }
}

Example 3 (unknown):

type SharedCounterBad struct {
    a int64
    b int64
}

type SharedCounterGood struct {
    a int64
    _ [56]byte // Padding to prevent a and b from sharing a cache line
    b int64
}

Example 4 (go):

func BenchmarkFalseSharing(b *testing.B) {
    var c SharedCounterBad  // (1)
    var wg sync.WaitGroup

    for b.Loop() {
        wg.Add(2)
        go func() {
            for i := 0; i < 1_000_000; i++ {
                c.a++
            }
            wg.Done()
        }()
        go func() {
            for i := 0; i < 1_000_000; i++ {
                c.b++
            }
            wg.Done()
        }()
        wg.Wait()
    }
}

Common Go Patterns for Performance - Go Optimization Guide

URL: https://goperf.dev/01-common-patterns/

Contents:

Common Go Patterns for Performance¶
Memory Management & Efficiency¶
Concurrency and Synchronization¶
I/O Optimization and Throughput¶
Compiler-Level Optimization and Tuning¶

Optimizing Go applications requires understanding common patterns that help reduce latency, improve memory efficiency, and enhance concurrency. This guide organizes 15 key techniques into four practical categories.

These strategies help reduce memory churn, avoid excessive allocations, and improve cache behavior.

Object Pooling Reuse objects to reduce GC pressure and allocation overhead.

Memory Preallocation Allocate slices and maps with capacity upfront to avoid costly resizes.

Struct Field Alignment Optimize memory layout to minimize padding and improve locality.

Avoiding Interface Boxing Prevent hidden allocations by avoiding unnecessary interface conversions.

Zero-Copy Techniques Minimize data copying with slicing and buffer tricks.

Memory Efficiency and Go’s Garbage Collector Reduce GC overhead by minimizing heap usage and reusing memory.

Stack Allocations and Escape Analysis Use escape analysis to help values stay on the stack where possible.

Manage goroutines, shared resources, and coordination efficiently.

Goroutine Worker Pools Control concurrency with a fixed-size pool to limit resource usage.

Atomic Operations and Synchronization Primitives Use atomic operations or lightweight locks to manage shared state.

Lazy Initialization (sync.Once) Delay expensive setup logic until it's actually needed.

Immutable Data Sharing Share data safely between goroutines without locks by making it immutable.

Efficient Context Management Use context to propagate timeouts and cancel signals across goroutines.

Reduce system call overhead and increase data throughput for I/O-heavy workloads.

Efficient Buffering Use buffered readers/writers to minimize I/O calls.

Batching Operations Combine multiple small operations to reduce round trips and improve throughput.

Tap into Go’s compiler and linker to further optimize your application.

Leveraging Compiler Optimization Flags Use build flags like -gcflags and -ldflags for performance tuning.

Stack Allocations and Escape Analysis Analyze which values escape to the heap to help the compiler optimize memory placement.

Raw

SKILL.md

name	description
go-networking	Go networking performance patterns and best practices. Use when optimizing network I/O, building high-performance servers, managing connections, tuning TCP/HTTP/gRPC, or diagnosing networking issues in Go applications.

Go-Networking Skill

Comprehensive assistance with go-networking development, generated from official documentation.

When to Use This Skill

This skill should be triggered when:

Working with go-networking
Asking about go-networking features or APIs
Implementing go-networking solutions
Debugging go-networking code
Learning go-networking best practices

Quick Reference

Common Patterns

Pattern 1: Go Optimization Guide GitHub Home Blog Common Performance Patterns Common Performance Patterns Memory Management & Efficiency Memory Management & Efficiency Object Pooling Memory Preallocation Struct Field Alignment Avoiding Interface Boxing Zero-Copy Techniques Memory Efficiency and Go’s Garbage Collector Stack Allocations and Escape Analysis Concurrency and Synchronization Concurrency and Synchronization Goroutine Worker Pools Atomic Operations and Synchronization Primitives Lazy Initialization Immutable Data Sharing Efficient Context Management I/O Optimization and Throughput I/O Optimization and Throughput Efficient Buffering Batching Operations Compiler-Level Optimization and Tuning Compiler-Level Optimization and Tuning Leveraging Compiler Optimization Flags Practical Networking Patterns Practical Networking Patterns Benchmarking First Benchmarking First Benchmarking and Load Testing for Networked Go Apps Practicle example of Profiling Networked Go Applications with pprof Foundations and Core Concepts Foundations and Core Concepts How Go Handles Networking Efficient Use of net/http, net.Conn, and UDP Scaling and Performance Engineering Scaling and Performance Engineering Managing 10K+ Concurrent Connections in Go GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning Table of contents Understanding GOMAXPROCS Diving into Go’s Scheduler Internals Netpoller: Deep Dive into epoll on Linux and kqueue on BSD Thread Pinning with LockOSThread and GODEBUG Flags CPU Affinity and External Tools Diagnostics and Resilience Diagnostics and Resilience Building Resilient Connection Handling Memory Management and Leak Prevention in Long-Lived Connections Transport-Level Optimization Transport-Level Optimization Comparing TCP, HTTP/2, and gRPC Performance in Go QUIC – Building Low-Latency Services with quic-go Low-Level and Advanced Tuning Low-Level and Advanced Tuning Socket Options That Matter Tuning DNS Performance in Go Services Optimizing TLS for Speed Connection Lifecycle Observability GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning¶ Go applications operating at high concurrency levels frequently encounter performance ceilings that are not attributable to CPU saturation. These limitations often stem from runtime-level mechanics: how goroutines (G) are scheduled onto logical processors (P) via operating system threads (M), how blocking operations affect thread availability, and how the runtime interacts with kernel facilities like epoll or kqueue for I/O readiness. Unlike surface-level code optimization, resolving these issues requires awareness of the Go scheduler’s internal design, particularly how GOMAXPROCS governs execution parallelism and how thread contention, cache locality, and syscall latency emerge under load. Misconfigured runtime settings can lead to excessive context switching, stalled P’s, and degraded throughput despite available cores. System-level tuning—through CPU affinity, thread pinning, and scheduler introspection—provides a critical path to improving latency and throughput in multicore environments. When paired with precise benchmarking and observability, these adjustments allow Go services to scale more predictably and fully take advantage of modern hardware architectures. Understanding GOMAXPROCS¶ In Go, GOMAXPROCS defines the maximum number of operating system threads (M’s) simultaneously executing user‑level Go code (G’s). It’s set to the developer's machine’s logical CPU count by default. Under the hood, the scheduler exposes P’s (processors) equal to GOMAXPROCS. Each P hosts a run queue of G’s and binds to a single M to execute Go code. package main import ( "fmt" "runtime" ) func main() { // Show current value fmt.Printf("GOMAXPROCS = %d\n", runtime.GOMAXPROCS(0)) // Set to 4 and confirm prev := runtime.GOMAXPROCS(4) fmt.Printf("Changed from %d to %d\n", prev, runtime.GOMAXPROCS(0)) } When developers increase GOMAXPROCS, developers allow more P’s—and therefore more OS threads—to run Go‑routines in parallel. That often boosts performance for CPU‑bound workloads. However, more P’s also incur more context switches, more cache thrashing, and potentially more contention in shared data structures (e.g., the garbage collector’s work queues). It's important to understand that blindly scaling past the sweet spot can actually degrade latency. Diving into Go’s Scheduler Internals¶ Go’s scheduler organizes three core actors: G (goroutine), M (OS thread), and P (logical processor), see more details here. When a goroutine makes a blocking syscall, its M detaches from its P, returning the P to the global scheduler so another M can pick it up. This design prevents syscalls from starving CPU‑bound goroutines. The scheduler uses work stealing: each P maintains a local run queue, and idle P’s will steal work from busier peers. If developers set GOMAXPROCS too high, developers will see diminishing returns in stolen work versus the overhead of balancing those run queues. Enabling scheduler tracing via GODEBUG can reveal fine grained metrics: GODEBUG=schedtrace=1000,scheddetail=1 go run main.go schedtrace=1000 instructs the runtime to print scheduler state every 1000 milliseconds (1 second). scheddetail=1 enables additional information per logical processor (P), such as individual run queue lengths. Each printed trace includes statistics like: SCHED 3024ms: gomaxprocs=14 idleprocs=14 threads=26 spinningthreads=0 needspinning=0 idlethreads=20 runqueue=0 gcwaiting=false nmidlelocked=1 stopwait=0 sysmonwait=false P0: status=0 schedtick=173 syscalltick=3411 m=nil runqsize=0 gfreecnt=6 timerslen=0 ... P13: status=0 schedtick=96 syscalltick=310 m=nil runqsize=0 gfreecnt=2 timerslen=0 M25: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil ... The first line reports global scheduler state including whether garbage collection is blocking (gcwaiting), if spinning threads are needed, and idle thread counts. Each P line details the logical processor's scheduler activity, including the number of times it's scheduled (schedtick), system call activity (syscalltick), timers, and free goroutine slots. The M lines correspond to OS threads. Each line shows which goroutine—if any—is running on that thread, whether the thread is idle, spinning, or blocked, along with memory allocation activity and lock states. This view makes it easier to spot not only classic concurrency bottlenecks but also deeper issues: scheduler delays, blocking syscalls, threads that spin without doing useful work, or CPU cores that sit idle when they shouldn’t. The output reveals patterns that aren’t visible from logs or metrics alone. gomaxprocs=14: Number of logical processors (P’s). idleprocs=14: All processors are idle, indicating no runnable goroutines. threads=26: Number of M’s (OS threads) created. spinningthreads=0: No threads are actively searching for work. needspinning=0: No additional spinning threads are requested by the scheduler. idlethreads=20: Number of OS threads currently idle. runqueue=0: Global run queue is empty. gcwaiting=false: Garbage collector is not blocking execution. nmidlelocked=1: One P is locked to a thread that is currently idle. stopwait=0: No goroutines waiting to stop the world. sysmonwait=false: The system monitor is actively running, not sleeping. The global run queue holds goroutines that are not bound to any specific P or that overflowed local queues. In contrast, each logical processor (P) maintains a local run queue of goroutines it is responsible for scheduling. Goroutines are preferentially enqueued locally for performance: local queues avoid lock contention and improve cache locality. It may be placed on the global queue only when a P's local queue is full, or a goroutine originates from outside a P (e.g., from a syscall). This dual-queue strategy reduces synchronization overhead across P’s and enables efficient scheduling under high concurrency. Understanding the ratio of local vs global queue activity helps diagnose whether the system is under-provisioned, improperly balanced, or suffering from excessive cross-P migrations. These insights help quantify how efficiently goroutines are scheduled, how much parallelism is actually utilized, and whether the system is under- or over-provisioned in terms of logical processors. Observing these patterns under load is crucial when adjusting GOMAXPROCS, diagnosing tail latency, or identifying scheduler contention. Netpoller: Deep Dive into epoll on Linux and kqueue on BSD¶ In any Go application handling high connection volumes, the network poller plays a critical behind-the-scenes role. At its core, Go uses the OS-level multiplexing facilities—epoll on Linux and kqueue on BSD/macOS—to monitor thousands of sockets concurrently with minimal threads. The runtime leverages these mechanisms efficiently, but understanding how and why reveals opportunities for tuning, especially under demanding loads. When a goroutine initiates a network operation like reading from a TCP connection, the runtime doesn't immediately block the underlying thread. Instead, it registers the file descriptor with the poller—using epoll_ctl in edge-triggered mode or EV_SET with EVFILT_READ—and parks the goroutine. The actual thread (M) becomes free to run other goroutines. When data arrives, the kernel signals the poller thread, which in turn wakes the appropriate goroutine by scheduling it onto a P’s run queue. This wakeup process minimizes contention by relying on per-P notification lists and avoids runtime lock bottlenecks. Go uses edge-triggered notifications, which signal only on state transitions—like new data becoming available. This design requires the application to drain sockets fully during each wakeup or risk missing future events. While more complex than level-triggered behavior, edge-triggered mode significantly reduces syscall overhead under load. Here's a simplified version of what happens under the hood during a read operation: func pollAndRead(conn net.Conn) ([]byte, error) { buf := make([]byte, 4096) for { n, err := conn.Read(buf) if n > 0 { return buf[:n], nil } if err != nil && !isTemporary(err) { return nil, err } // Data not ready yet — goroutine will be parked until poller wakes it } } Internally, Go runs a dedicated poller thread that loops on epoll_wait or kevent, collecting batches of events (typically 512 at a time). After the call returns, the runtime processes these events, distributing wakeups across logical processors to prevent any single P from becoming a bottleneck. To further promote scheduling fairness, the poller thread may rotate across P’s periodically, a behavior governed by GODEBUG=netpollWaitLatency. Go’s runtime is optimized to reduce unnecessary syscalls and context switches. All file descriptors are set to non-blocking, which allows the poller thread to remain responsive. To avoid the thundering herd problem—where multiple threads wake on the same socket—the poller ensures only one goroutine handles a given FD event at a time. The design goes even further by aligning the circular event buffer with cache lines and distributing wakeups via per-P lists. These details matter at scale. With proper alignment and locality, Go reduces CPU cache contention when thousands of connections are active. For developers looking to inspect poller behavior, enabling tracing with GODEBUG=netpoll=1 can surface system-level latencies and epoll activity. Additionally, the GODEBUG=netpollWaitLatency=200 flag configures the poller’s willingness to hand off to another P every 200 microseconds. That’s particularly helpful in debugging idle P starvation or evaluating fairness in high-throughput systems. Here's a small experiment that logs event activity: GODEBUG=netpoll=1 go run main.go You’ll see log lines like: runtime: netpoll: poll returned n=3 runtime: netpoll: waking g=102 for fd=5 Most developers never need to think about this machinery—and they shouldn't. But these details become valuable in edge cases, like high-throughput HTTP proxies or latency-sensitive services dealing with hundreds of thousands of concurrent sockets. Tuning parameters like GOMAXPROCS, adjusting the event buffer size, or modifying poller wake-up intervals can yield measurable performance improvements, particularly in tail latencies. For example, in a system handling hundreds of thousands of concurrent HTTP/2 streams, increasing GOMAXPROCS while using GODEBUG=netpollWaitLatency=100 helped reduce the 99th percentile read latency by over 15%, simply by preventing poller starvation under I/O backpressure. As with all low-level tuning, it's not about changing knobs blindly. It's about knowing what Go’s netpoller is doing, why it’s structured the way it is, and where its boundaries can be nudged for just a bit more efficiency—when measurements tell you it’s worth it. Thread Pinning with LockOSThread and GODEBUG Flags¶ Go offers tools like runtime.LockOSThread() to pin a goroutine to a specific OS thread, but in most real-world applications, the payoff is minimal. Benchmarks consistently show that for typical server workloads—especially those that are CPU-bound—Go’s scheduler handles thread placement well without manual intervention. Introducing thread pinning tends to add complexity without delivering measurable gains. There are exceptions. In ultra-low-latency or real-time systems, pinning can help reduce jitter by avoiding thread migration. But these gains typically require isolated CPU cores, tightly controlled environments, and strict latency targets. In practice, that means bare metal. On shared infrastructure—especially in cloud environments like AWS where cores are virtualized and noisy neighbors are common—thread pinning rarely delivers any measurable benefit. If you’re exploring pinning, it’s not enough to assume benefit—you need to benchmark it. Enabling GODEBUG=schedtrace=1000,scheddetail=1 gives detailed insight into how goroutines are scheduled and whether contention or migration is actually a problem. Without that evidence, thread pinning is more likely to hinder than help. Here's how developers might pin threads cautiously: runtime.LockOSThread() defer runtime.UnlockOSThread() // perform critical latency-sensitive work here Always pair such modifications with extensive metrics collection and scheduler tracing (GODEBUG=schedtrace=1000,scheddetail=1) to validate tangible gains over Go’s robust default scheduling behavior. CPU Affinity and External Tools¶ Using external tools like taskset or system calls such as sched_setaffinity can bind threads or processes to specific CPU cores. While theoretically beneficial for cache locality and predictable performance, extensive benchmarking consistently demonstrates limited practical value in most Go applications. Explicit CPU affinity management typically helps only in tightly controlled environments with: Real-time latency constraints (microsecond-level jitter). Dedicated and isolated CPUs (e.g., via Linux kernel’s isolcpus). Avoidance of thread migration on NUMA hardware. Example of cautious CPU affinity usage: func setAffinity(cpuList []int) error { pid := os.Getpid() var mask unix.CPUSet for _, cpu := range cpuList { mask.Set(cpu) } return unix.SchedSetaffinity(pid, &mask) } func main() { runtime.LockOSThread() defer runtime.UnlockOSThread() if err := setAffinity([]int{2, 3}); err != nil { log.Fatalf("CPU affinity failed: %v", err) } // perform critical work with confirmed benefit } Without dedicated benchmarking and validation, these techniques may degrade performance, starve other processes, or introduce subtle latency regressions. Treat thread pinning and CPU affinity as highly specialized tools—effective only after meticulous measurement confirms their benefit. Tuning Go at the scheduler level can unlock significant performance gains, but it demands an intimate understanding of P’s, M’s, and G’s. Blindly upping GOMAXPROCS or pinning threads without measurement can backfire. the advice is to treat these knobs as surgical tools: use GODEBUG traces to diagnose, isolate subsystems where affinity or pinning makes sense, and always validate with benchmarks and profiles. Go’s runtime is ever‑evolving. Upcoming work in preemptive scheduling and user‑level interrupts promises to reduce tail latency further and improve fairness. Until then, these low‑level levers remain some of the most powerful ways to squeeze every drop of performance from developer's Go services.

epoll

Pattern 2: Go Optimization Guide GitHub Home Blog Common Performance Patterns Common Performance Patterns Memory Management & Efficiency Memory Management & Efficiency Object Pooling Memory Preallocation Struct Field Alignment Avoiding Interface Boxing Zero-Copy Techniques Memory Efficiency and Go’s Garbage Collector Stack Allocations and Escape Analysis Concurrency and Synchronization Concurrency and Synchronization Goroutine Worker Pools Atomic Operations and Synchronization Primitives Lazy Initialization Immutable Data Sharing Efficient Context Management I/O Optimization and Throughput I/O Optimization and Throughput Efficient Buffering Batching Operations Compiler-Level Optimization and Tuning Compiler-Level Optimization and Tuning Leveraging Compiler Optimization Flags Practical Networking Patterns Practical Networking Patterns Benchmarking First Benchmarking First Benchmarking and Load Testing for Networked Go Apps Practicle example of Profiling Networked Go Applications with pprof Foundations and Core Concepts Foundations and Core Concepts How Go Handles Networking Efficient Use of net/http, net.Conn, and UDP Scaling and Performance Engineering Scaling and Performance Engineering Managing 10K+ Concurrent Connections in Go GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning Diagnostics and Resilience Diagnostics and Resilience Building Resilient Connection Handling Memory Management and Leak Prevention in Long-Lived Connections Transport-Level Optimization Transport-Level Optimization Comparing TCP, HTTP/2, and gRPC Performance in Go Comparing TCP, HTTP/2, and gRPC Performance in Go Table of contents Raw TCP with Custom Framing Custom Framing Protocol Protocol Structure Disadvantages Performance Insights HTTP/2 via net/http Server Implementation Performance Insights gRPC gRPC Service Definition Performance Insights Choosing the Right Protocol QUIC – Building Low-Latency Services with quic-go Low-Level and Advanced Tuning Low-Level and Advanced Tuning Socket Options That Matter Tuning DNS Performance in Go Services Optimizing TLS for Speed Connection Lifecycle Observability Comparing TCP, HTTP/2, and gRPC Performance in Go¶ In distributed systems, the choice of communication protocol shapes how services interact under real-world load. It influences not just raw throughput and latency, but also how well the system scales, how much CPU and memory it consumes, and how predictable its behavior remains under pressure. In this article, we dissect three prominent options—raw TCP with custom framing, HTTP/2 via Go's built-in net/http package, and gRPC—and explore their performance characteristics through detailed benchmarks and practical scenarios. Raw TCP with Custom Framing¶ Raw TCP provides maximum flexibility with virtually no protocol overhead, but that comes at a cost: all message boundaries, framing logic, and error handling must be implemented manually. Since TCP delivers a continuous byte stream with no inherent notion of messages, applications must explicitly define how to separate and interpret those bytes. Custom Framing Protocol¶ A common way to handle message boundaries over TCP is to use length-prefix framing: each message starts with a 4-byte header that tells the receiver how many bytes to read next. The length is encoded in big-endian format, following the standard network byte order, so it behaves consistently across different systems. This setup solves a core issue with TCP—while it guarantees reliable delivery, it doesn’t preserve message boundaries. Without knowing the size upfront, the receiver has no way to tell where one message ends and the next begins. TCP guarantees reliable, in-order delivery of bytes, but it does not preserve or indicate message boundaries. For example, if a client sends three logical messages: [msg1][msg2][msg3] the server may receive them as a continuous byte stream with arbitrary segmentations, such as: [msg1_part][msg2][msg3_part] TCP delivers a continuous stream of bytes with no built-in concept of where one message stops and another starts. This means the receiver can’t rely on read boundaries to infer message boundaries—what arrives might be a partial message, multiple messages concatenated, or an arbitrary slice of both. To make sense of structured data over such a stream, the application needs a framing strategy. Length-prefixing does this by including the size of the message up front, so the receiver knows exactly how many bytes to expect before starting to parse the payload. Protocol Structure¶ While length-prefixing is the most common and efficient framing strategy, there are other options depending on the use case. Other framing strategies exist, each with its own trade-offs in terms of simplicity, robustness, and flexibility. Delimiter-based framing uses a specific byte or sequence—like \n or \0 to signal the end of a message. It’s easy to implement but fragile if the delimiter can appear in the payload. Fixed-size framing avoids ambiguity by making every message the same length, which simplifies parsing and memory allocation but doesn’t work well when message sizes vary. Self-describing formats like Protobuf or ASN.1 embed length and type information inside the payload itself, allowing for richer structure and evolution over time, but require more sophisticated parsing logic and schema awareness on both ends. Choosing the right approach depends on how much control you need, how predictable your data is, and how much complexity you’re willing to absorb. Each frame of length-prefixing implementation consists of: | Length (4 bytes) | Payload (Length bytes) | Length: A 4-byte unsigned integer encoded in big-endian format (network byte order), representing the number of bytes in the payload. Payload: Raw binary data of arbitrary length. The use of binary.BigEndian.PutUint32 ensures the frame length is encoded in a standardized format—most significant byte first. This is consistent with Internet protocol standards (1) , allowing for predictable decoding and reliable interoperation between heterogeneous systems. Following the established convention of network byte order, which is defined as big-endian in RFC 791, Section 3.1 and used consistently in transport and application protocols such as TCP (RFC 793). func writeFrame(conn net.Conn, payload []byte) error { frameLen := uint32(len(payload)) buf := make([]byte, 4+len(payload)) binary.BigEndian.PutUint32(buf[:4], frameLen) copy(buf[4:], payload) _, err := conn.Write(buf) return err } func readFrame(conn net.Conn) ([]byte, error) { lenBuf := make([]byte, 4) if _, err := io.ReadFull(conn, lenBuf); err != nil { return nil, err } frameLen := binary.BigEndian.Uint32(lenBuf) payload := make([]byte, frameLen) if _, err := io.ReadFull(conn, payload); err != nil { return nil, err } return payload, nil } This approach is straightforward, performant, and predictable, yet it provides no built-in concurrency management, request multiplexing, or flow control—these must be explicitly managed by the developer. Disadvantages¶ While the protocol is efficient and minimal, it lacks several features commonly found in more complex transport protocols. The lack of built-in framing features in raw TCP means that key responsibilities shift entirely to the application layer. There’s no support for multiplexing, so only one logical message can be in flight per connection unless additional coordination is built manually—pushing clients to open multiple connections to achieve parallelism. Flow control is also absent; unlike HTTP/2 or gRPC, there’s no way to signal backpressure, making it easy for a fast sender to overwhelm a slow receiver, potentially exhausting memory or triggering a crash. There’s no space for structured metadata like message types, compression flags, or trace context unless you embed them yourself into the payload format. And error handling is purely ad hoc—there’s no protocol-level mechanism for communicating faults, so malformed frames or incorrect lengths often lead to abrupt connection resets or inconsistent state. These limitations might be manageable in tightly scoped, high-performance systems where both ends of the connection are under full control and the protocol behavior is well understood. In such environments, the minimal overhead and direct access to the wire can justify the trade-offs. But in broader production contexts—especially those involving multiple teams, evolving requirements, or untrusted clients—they introduce significant risk. Without strict validation, clear framing, and robust error handling, even small inconsistencies can lead to silent corruption, resource leaks, or hard-to-diagnose failures. Building on raw TCP demands both precise engineering and long-term maintenance discipline. Performance Insights¶ Latency: Lowest achievable due to minimal overhead; ideal for latency-critical scenarios like financial trading systems. Throughput: Excellent, constrained only by network and application-layer handling. CPU/Memory Cost: Very low, with negligible overhead from protocol management. HTTP/2 via net/http¶ HTTP/2 brought several protocol-level improvements over HTTP/1.1, including multiplexed streams over a single connection, header compression via HPACK, and support for server push. In Go, these features are integrated directly into the net/http standard library, which handles connection reuse, stream multiplexing, and concurrency without requiring manual intervention. Unlike raw TCP, where applications must explicitly define message boundaries, HTTP/2 defines them at the protocol level: each request and response is framed using structured HEADERS and DATA frames and explicitly closed with an END_STREAM flag. These frames are handled entirely within Go’s HTTP/2 implementation, so developers interact with complete, logically isolated messages using the standard http.Request and http.ResponseWriter interfaces. You don’t have to deal with byte streams or worry about where a message starts or ends—by the time a request hits your handler, it’s already been framed and parsed. When you write a response, the runtime takes care of wrapping it up and signaling completion. That frees you up to focus on the logic, not the plumbing, while still getting the performance benefits of HTTP/2 like multiplexing and connection reuse. Server Implementation¶ Beyond framing and multiplexing, HTTP/2 brings a handful of practical advantages that make server code easier to write and faster to run. It handles connection reuse out of the box, applies flow control to avoid overwhelming either side, and compresses headers using HPACK to cut down on overhead. Go’s net/http stack takes care of all of this behind the scenes, so you get the benefits without needing to wire it up yourself. As a result, developers can build concurrent, efficient servers without managing low-level connection or stream state manually. func handler(w http.ResponseWriter, r *http.Request) { payload, err := io.ReadAll(r.Body) if err != nil { http.Error(w, "invalid request", http.StatusBadRequest) return } defer r.Body.Close() // Process payload... w.WriteHeader(http.StatusOK) w.Write([]byte("processed")) } func main() { server := &http.Server{ Addr: ":8080", Handler: http.HandlerFunc(handler), } log.Fatal(server.ListenAndServeTLS("server.crt", "server.key")) } Info Even this is not mentioned explisitly, this code serves HTTP/2 because it uses ListenAndServeTLS, which enables TLS-based communication. Go's net/http package automatically upgrades connections to HTTP/2 when a client supports it via ALPN (Application-Layer Protocol Negotiation) during the TLS handshake. Since Go 1.6, this upgrade is implicit—no extra configuration is required. The server transparently handles HTTP/2 requests while remaining compatible with HTTP/1.1 clients. HTTP/2’s multiplexing capability allows multiple independent streams to share a single TCP connection without blocking each other, which significantly improves connection reuse. This reduces the overhead of establishing and managing parallel connections, especially under high concurrency. As a result, latency is lower and throughput more consistent, even when multiple requests are in flight. These traits make HTTP/2 well-suited for general-purpose web services and internal APIs—places where predictable latency, efficient connection reuse, and solid concurrency handling carry more weight than raw protocol minimalism. Performance Insights¶ Latency: Slightly higher than raw TCP because of framing and compression overhead, but stable and consistent thanks to multiplexing and persistent connections. Throughput: High under concurrent load; stream multiplexing and header compression help sustain performance without opening more sockets. CPU/Memory Cost: Moderate overhead, mostly due to header processing, TLS encryption, and flow control mechanisms. gRPC¶ gRPC is a high-performance, contract-first RPC framework built on top of HTTP/2, designed for low-latency, cross-language communication between services. It combines streaming-capable transport with strongly typed APIs defined using Protocol Buffers (Protobuf), enabling compact, efficient message serialization and seamless interoperability across platforms. Unlike traditional HTTP APIs, where endpoints are loosely defined by URL patterns and free-form JSON, gRPC enforces strict interface contracts through .proto definitions, which serve as both schema and implementation spec. The gRPC toolchain generates client and server code for multiple languages, eliminating manual serialization, improving safety, and standardizing interactions across heterogeneous systems. gRPC takes advantage of HTTP/2’s core features—stream multiplexing, flow control, and binary framing—to support both one-off RPC calls and full-duplex streaming, all with built-in backpressure. But it goes further than just transport. It bakes in support for deadlines, cancellation, structured metadata, and standardized error reporting, all of which help services communicate clearly and fail predictably. This makes gRPC a solid choice for internal APIs, service meshes, and performance-critical systems where you need efficiency, strong contracts, and reliable behavior under load. gRPC Service Definition¶ A minimal .proto file example: syntax = "proto3"; service EchoService { rpc Echo(EchoRequest) returns (EchoResponse); } message EchoRequest { string message = 1; } message EchoResponse { string message = 1; } Generated Go stubs allow developers to easily implement the service: type server struct { UnimplementedEchoServiceServer } func (s *server) Echo(ctx context.Context, req *EchoRequest) (*EchoResponse, error) { return &EchoResponse{Message: req.Message}, nil } func main() { lis, err := net.Listen("tcp", ":50051") if err != nil { log.Fatalf("failed to listen: %v", err) } grpcServer := grpc.NewServer() RegisterEchoServiceServer(grpcServer, &server{}) grpcServer.Serve(lis) } Performance Insights¶ Latency: Slightly higher than raw HTTP/2 due to additional serialization/deserialization steps, yet still performant for most scenarios. Throughput: High throughput thanks to efficient payload serialization (protobuf) and inherent HTTP/2 multiplexing capabilities. CPU/Memory Cost: Higher than HTTP/2 due to protobuf encoding overhead; memory consumption slightly increased due to temporary object allocations. Choosing the Right Protocol¶ Internal APIs and microservices: gRPC usually hits the sweet spot—it’s fast, strongly typed, and easy to work with once the tooling is in place. Low-latency systems and trading platforms: Raw TCP with custom framing gives you the lowest overhead, but you’re on your own for everything else. Public APIs or general web services: HTTP/2 via net/http is a solid choice. You get connection reuse, multiplexing, and good performance without needing to pull in a full RPC stack. Raw TCP gives you maximum control and the best performance on paper—but it also means building everything yourself: framing, flow control, error handling. HTTP/2 and gRPC trade some of that raw speed for built-in structure, better connection handling, and less code to maintain. What’s right depends entirely on where performance matters and how much complexity you want to own.

net/http

Pattern 3: A minimal .proto file example:

.proto

Pattern 4: Example: Optimizing buffer reuse using sync.Pool greatly reduces GC pressure during high-volume network operations.

sync.Pool

Pattern 5: Here’s the safe pattern:

io.Copy(io.Discard, resp.Body)
defer resp.Body.Close()

Pattern 6: Go Optimization Guide GitHub Home Blog Common Performance Patterns Common Performance Patterns Memory Management & Efficiency Memory Management & Efficiency Object Pooling Memory Preallocation Struct Field Alignment Avoiding Interface Boxing Zero-Copy Techniques Memory Efficiency and Go’s Garbage Collector Stack Allocations and Escape Analysis Concurrency and Synchronization Concurrency and Synchronization Goroutine Worker Pools Atomic Operations and Synchronization Primitives Lazy Initialization Immutable Data Sharing Efficient Context Management I/O Optimization and Throughput I/O Optimization and Throughput Efficient Buffering Batching Operations Compiler-Level Optimization and Tuning Compiler-Level Optimization and Tuning Leveraging Compiler Optimization Flags Practical Networking Patterns Practical Networking Patterns Benchmarking First Benchmarking First Benchmarking and Load Testing for Networked Go Apps Practicle example of Profiling Networked Go Applications with pprof Practicle example of Profiling Networked Go Applications with pprof Table of contents CPU Profiling in Networked Apps CPU Profiling Walkthrough: Load on the /gc Endpoint Where the Time Went HTTP Stack Dominates the Surface Garbage Collection Overhead is Clearly Visible I/O and Syscalls Take a Big Slice Scheduler Activity Is Non-Trivial Memory Profiling: Retained Heap from the /gc Endpoint Summary: CPU and Memory Profiling of the /gc Endpoint Foundations and Core Concepts Foundations and Core Concepts How Go Handles Networking Efficient Use of net/http, net.Conn, and UDP Scaling and Performance Engineering Scaling and Performance Engineering Managing 10K+ Concurrent Connections in Go GOMAXPROCS, epoll/kqueue, and Scheduler-Level Tuning Diagnostics and Resilience Diagnostics and Resilience Building Resilient Connection Handling Memory Management and Leak Prevention in Long-Lived Connections Transport-Level Optimization Transport-Level Optimization Comparing TCP, HTTP/2, and gRPC Performance in Go QUIC – Building Low-Latency Services with quic-go Low-Level and Advanced Tuning Low-Level and Advanced Tuning Socket Options That Matter Tuning DNS Performance in Go Services Optimizing TLS for Speed Connection Lifecycle Observability Practical Example: Profiling Networked Go Applications with pprof¶ This section walks through a demo application instrumented with benchmarking tools and runtime profiling to ground profiling concepts in a real-world context. It covers identifying performance bottlenecks, interpreting flame graphs, and analyzing system behavior under various simulated network conditions. CPU Profiling in Networked Apps¶ The demo application is intentionally designed to be as simple as possible to highlight key profiling concepts without unnecessary complexity. While the code and patterns used in the demo are basic, the profiling insights gained here are highly applicable to more complex, production-grade applications. To enable continuous profiling under load, we expose pprof via a dedicated HTTP endpoint: import ( _ "net/http/pprof" ) // ... // Start pprof in a separate goroutine. go func() { log.Println("pprof listening on :6060") if err := http.ListenAndServe("localhost:6060", nil); err != nil { log.Fatalf("pprof server error: %v", err) } }() full net-app's source code package main // pprof-start import ( // pprof-end "flag" "fmt" "log" "math/rand/v2" "net/http" // pprof-start _ "net/http/pprof" // pprof-end "os" "os/signal" "time" // pprof-start ) // pprof-end var ( fastDelay = flag.Duration("fast-delay", 0, "Fixed delay for fast handler (if any)") slowMin = flag.Duration("slow-min", 1time.Millisecond, "Minimum delay for slow handler") slowMax = flag.Duration("slow-max", 300time.Millisecond, "Maximum delay for slow handler") gcMinAlloc = flag.Int("gc-min-alloc", 50, "Minimum number of allocations in GC heavy handler") gcMaxAlloc = flag.Int("gc-max-alloc", 1000, "Maximum number of allocations in GC heavy handler") ) func randRange(min, max int) int { return rand.IntN(max-min) + min } func fastHandler(w http.ResponseWriter, r *http.Request) { if *fastDelay > 0 { time.Sleep(*fastDelay) } fmt.Fprintln(w, "fast response") } func slowHandler(w http.ResponseWriter, r *http.Request) { delayRange := int((*slowMax - *slowMin) / time.Millisecond) delay := time.Duration(randRange(1, delayRange)) * time.Millisecond time.Sleep(delay) fmt.Fprintf(w, "slow response with delay %d ms\n", delay.Milliseconds()) } // heavy-start var longLivedData [][]byte func gcHeavyHandler(w http.ResponseWriter, r *http.Request) { numAllocs := randRange(*gcMinAlloc, gcMaxAlloc) var data [][]byte for i := 0; i < numAllocs; i++ { // Allocate 10KB slices. Occasionally retain a reference to simulate long-lived objects. b := make([]byte, 102410) data = append(data, b) if i%100 == 0 { // every 100 allocations, keep the data alive longLivedData = append(longLivedData, b) } } fmt.Fprintf(w, "allocated %d KB\n", len(data)*10) } // heavy-end func main() { flag.Parse() http.HandleFunc("/fast", fastHandler) http.HandleFunc("/slow", slowHandler) http.HandleFunc("/gc", gcHeavyHandler) // pprof-start // ... // Start pprof in a separate goroutine. go func() { log.Println("pprof listening on :6060") if err := http.ListenAndServe("localhost:6060", nil); err != nil { log.Fatalf("pprof server error: %v", err) } }() // pprof-end // Create a server to allow for graceful shutdown. server := &http.Server{Addr: ":8080"} go func() { log.Println("HTTP server listening on :8080") if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed { log.Fatalf("HTTP server error: %v", err) } }() // Graceful shutdown on interrupt signal. sigCh := make(chan os.Signal, 1) signal.Notify(sigCh, os.Interrupt) <-sigCh log.Println("Shutting down server...") if err := server.Shutdown(nil); err != nil { log.Fatalf("Server Shutdown Failed:%+v", err) } log.Println("Server exited") } The next step will be to establish a connection with the profiled app and collect samples: go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30 View results interactively: go tool pprof -http=:7070 cpu.prof # (1) the actual cpu.prof path will be something like $HOME/pprof/pprof.net-app.samples.cpu.004.pb.gz or you can save the profiling graph as an svg image. CPU Profiling Walkthrough: Load on the /gc Endpoint¶ We profiled the application during a 30-second load test targeting the /gc endpoint to see what happens under memory pressure. This handler was intentionally designed to trigger allocations and force garbage collection, which makes it a great candidate for observing runtime behavior under stress. We used Go’s built-in profiler to capture a CPU trace: CPU profiling trace for the /gc endpoint This gave us 3.02 seconds of sampled CPU activity out of 30 seconds of wall-clock time—a useful window into what the runtime and application were doing under pressure. Where the Time Went¶ HTTP Stack Dominates the Surface¶ As expected, the majority of CPU time was spent on request handling: http.(*conn).serve accounted for nearly 58% of sampled time http.serverHandler.ServeHTTP appeared prominently as well This aligns with the fact that we were sustaining constant traffic. The Go HTTP stack is doing the bulk of the work, managing connections and dispatching requests. Garbage Collection Overhead is Clearly Visible¶ A large portion of CPU time was spent inside the garbage collector: runtime.gcDrain, runtime.scanobject, and runtime.gcBgMarkWorker were all active Combined with memory-related functions like runtime.mallocgc, these accounted for roughly 20% of total CPU time This confirms that gcHeavyHandler is achieving its goal. What we care about is whether this kind of allocation pressure leaks into real-world handlers. If it does, we’re paying for it in latency and CPU churn. I/O and Syscalls Take a Big Slice¶ We also saw high syscall activity—especially from: syscall.syscall (linked to poll, Read, and Write) bufio.Writer.Flush and http.response.finishRequest These functions reflect the cost of writing responses back to clients. For simple handlers, this is expected. But if your handler logic is lightweight and most of the time is spent just flushing data over TCP, it’s worth asking whether the payloads or buffer strategies could be optimized. Scheduler Activity Is Non-Trivial¶ Functions like runtime.schedule, mcall, and findRunnable were also on the board. These are Go runtime internals responsible for managing goroutines. Seeing them isn’t unusual during high-concurrency tests—but if they dominate, it often points to excessive goroutine churn or blocking behavior. Memory Profiling: Retained Heap from the /gc Endpoint¶ We also captured a memory profile to complement the CPU view while hammering the /gc endpoint. This profile used the inuse_space metric, which shows how much heap memory is actively retained by each function at the time of capture. We triggered the profile with: go tool pprof -http=:7070 http://localhost:6060/debug/pprof/heap Memory profiling for the /gc endpoint At the time of capture, the application retained 649MB of heap memory, and almost all of it—99.46%—was attributed to a single function: gcHeavyHandler. This was expected. The handler simulates allocation pressure by creating 10KB slices in a tight loop. Every 100th slice is added to a global variable to simulate long-lived memory. Here’s what the handler does: var longLivedData [][]byte func gcHeavyHandler(w http.ResponseWriter, r *http.Request) { numAllocs := randRange(*gcMinAlloc, gcMaxAlloc) var data [][]byte for i := 0; i < numAllocs; i++ { // Allocate 10KB slices. Occasionally retain a reference to simulate long-lived objects. b := make([]byte, 102410) data = append(data, b) if i%100 == 0 { // every 100 allocations, keep the data alive longLivedData = append(longLivedData, b) } } fmt.Fprintf(w, "allocated %d KB\n", len(data)*10) } The flamegraph confirmed what we expected: gcHeavyHandler accounted for nearly all memory in use. The path traced cleanly from the HTTP connection, through the Go router stack, into the handler logic. No significant allocations came from elsewhere—this was a focused, controlled memory pressure scenario. This type of profile is valuable because it reveals what is still being held in memory, not just what was allocated. This view is often the most revealing for diagnosing leaks, retained buffers, or forgotten references. Summary: CPU and Memory Profiling of the /gc Endpoint¶ The /gc endpoint was intentionally built to simulate high allocation pressure and GC activity. Profiling this handler under load gave us a clean, focused view of how the Go runtime behaves when pushed to its memory limits. From the CPU profile, we saw that: As expected, most of the time was spent in the HTTP handler path during sustained load. Nearly 20% of CPU samples were attributed to memory allocation and garbage collection. Syscall activity was high, mostly from writing responses. The Go scheduler was moderately active, managing the concurrent goroutines handling traffic. From the memory profile, we captured 649MB of live heap usage, with 99.46% of it retained by gcHeavyHandler. This matched our expectations: the handler deliberately retains every 100th 10KB allocation to simulate long-lived data. Together, these profiles give us confidence that the /gc endpoint behaves as intended under synthetic pressure: It creates meaningful CPU and memory load. It exposes the cost of sustained allocations and GC cycles. It provides a predictable environment for testing optimizations or GC tuning strategies.

pprof

Reference Files

This skill includes comprehensive documentation in references/:

benchmarking.md - Benchmarking documentation
connection_management.md - Connection Management documentation
dns_tuning.md - Dns Tuning documentation
networking_fundamentals.md - Networking Fundamentals documentation
other.md - Other documentation
tls_security.md - Tls Security documentation

Use view to read specific reference files when detailed information is needed.

Working with This Skill

For Beginners

Start with the getting_started or tutorials reference files for foundational concepts.

For Specific Features

Use the appropriate category reference file (api, guides, etc.) for detailed information.

For Code Examples

The quick reference section above contains common patterns extracted from the official docs.

Resources

references/

Organized documentation extracted from official sources. These files contain:

Detailed explanations
Code examples with language annotations
Links to original documentation
Table of contents for quick navigation

scripts/

Add helper scripts here for common automation tasks.

assets/

Add templates, boilerplate, or example projects here.

Notes

This skill was automatically generated from official documentation
Reference files preserve the structure and examples from source docs
Code examples include language detection for better syntax highlighting
Quick reference patterns are extracted from common usage examples in the docs

Updating

To refresh this skill with updated documentation:

Re-run the scraper with the same configuration
The skill will be rebuilt with the latest information

efstathiosntonas/SKILL.md

Go-Performance - Compiler Optimization

Leveraging Compiler Optimization Flags - Go Optimization Guide

Go-Performance - Concurrency

Lazy Initialization - Go Optimization Guide

Goroutine Worker Pools - Go Optimization Guide

Efficient Context Management - Go Optimization Guide

Immutable Data Sharing - Go Optimization Guide

Atomic Operations and Synchronization Primitives - Go Optimization Guide

Go-Performance - Escape Analysis

Stack Allocations and Escape Analysis - Go Optimization Guide

Go-Performance - Garbage Collector

Memory Efficiency and Go’s Garbage Collector - Go Optimization Guide

Go-Performance Documentation Index

Categories

Compiler Optimization

Concurrency

Escape Analysis

Garbage Collector

Io Optimization

Memory Management

Go-Performance - Io Optimization

Batching Operations - Go Optimization Guide

Efficient Buffering - Go Optimization Guide

Go-Performance - Memory Management

Memory Preallocation - Go Optimization Guide

Zero-Copy Techniques - Go Optimization Guide

Object Pooling - Go Optimization Guide

Avoiding Interface Boxing - Go Optimization Guide

Struct Field Alignment - Go Optimization Guide

Common Go Patterns for Performance - Go Optimization Guide

Go-Networking Skill

When to Use This Skill

Quick Reference

Common Patterns

Reference Files

Working with This Skill

For Beginners

For Specific Features

For Code Examples

Resources

references/

scripts/

assets/

Notes

Updating