Skip to content

Instantly share code, notes, and snippets.

@ethanmclark1
Created March 13, 2026 17:20
Show Gist options
  • Select an option

  • Save ethanmclark1/61779d2ea4949ff06bfb75676db6dc84 to your computer and use it in GitHub Desktop.

Select an option

Save ethanmclark1/61779d2ea4949ff06bfb75676db6dc84 to your computer and use it in GitHub Desktop.

Replace gRPC with Shared Memory for Local Zero-Copy Communication

Context

LuckyEngine and luckyrobots (Python) always run on the same machine — even in cloud (GCP/AWS), both are co-located on the same node. No remote transport needed. The goal is true zero-copy shared memory replacing gRPC+Protobuf for all communication.

Key insight: AgentBatch.StateBuffer is already a flat float[num_envs * obs_size] (AgentBatch.cs:74). By allocating this directly in shared memory, agents write observations straight into shared memory with zero copies end-to-end.

Architecture

┌─────────────────────────────────────────────────┐
│              LuckyEngine (C#/C++)                │
│                                                  │
│  AgentBatch.StateBuffer ──► Shared Memory        │
│  AgentBatch.ActionBuffer ◄── Shared Memory       │
│       (direct buffer mapping, no copies)         │
│                                                  │
│  Command Channel: FlatBuffers in shm ring buffer │
│  (reset, schema, scene queries, debug, etc.)     │
└──────────────────┬───────────────────────────────┘
           Shared Memory (MemoryMappedFile / mmap)
                   │
┌──────────────────┴───────────────────────────────┐
│              luckyrobots (Python)                  │
│                                                    │
│  obs = np.ndarray(buffer=shm, ...)  # zero-copy   │
│  actions_view[:] = policy_output    # direct write │
│                                                    │
│  Command channel: FlatBuffers for reset/schema     │
└────────────────────────────────────────────────────┘

Two-Tier Design

Tier 1 — Hot Data (raw arrays in shared memory, no serialization): Observations, actions, rewards, dones — exchanged every step at 50Hz+ for up to 4096 envs.

Tier 2 — Command Channel (FlatBuffers in shared memory ring buffer): Setup, reset, schema queries, image streaming, scene inspection, debug drawing. All 7 current gRPC services migrate here.


Phase 1: Shared Memory Layout + Lock-Step Protocol

1.1 Memory Layout

┌─────────────────────────── HEADER (4096 bytes) ─────────────────────────┐
│ magic: u32 = 0x4C524243 ("LRBC")                                       │
│ version: u32 = 1                                                        │
│ server_pid: u32           # Engine process ID                           │
│ client_pid: u32           # Python process ID (set by client on attach) │
│                                                                         │
│ mode: u8                  # 0=Training (homogeneous), 1=Inference       │
│ num_envs: u32             # Number of environments                      │
│ obs_size: u32             # Observation floats per env                  │
│ act_size: u32             # Action floats per env                       │
│                                                                         │
│ frame_seq: u64 (atomic)   # Engine increments after writing all obs     │
│ action_seq: u64 (atomic)  # Client increments after writing all actions │
│                                                                         │
│ obs_offset: u64           # Byte offset to observations region          │
│ act_offset: u64           # Byte offset to actions region               │
│ rewards_offset: u64       # Byte offset to rewards                      │
│ dones_offset: u64         # Byte offset to dones                        │
│ resets_offset: u64        # Byte offset to reset flags                  │
│ cmd_s2c_offset: u64       # Server→Client command ring buffer offset    │
│ cmd_c2s_offset: u64       # Client→Server command ring buffer offset    │
│ cmd_ring_size: u32        # Size of each command ring buffer             │
│ image_offset: u64         # Byte offset to image region (0 if none)     │
│ image_width: u32                                                        │
│ image_height: u32                                                       │
│ image_channels: u32                                                     │
│                                                                         │
│ [GPU extensions - for future GPU MuJoCo]                                │
│ gpu_available: u8         # 1 if GPU buffers exported                   │
│ gpu_obs_mode: u8          # 0=CPU, 1=CUDA_IPC, 2=VULKAN_EXTERNAL       │
│ gpu_obs_handle: byte[64]  # cudaIpcMemHandle_t                          │
│ gpu_device_id: u32        # CUDA device ordinal                         │
│                                                                         │
│ reserved[...] to 4096 bytes                                             │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────── HOT DATA REGION ────────────────────────────────┐
│ Observations: float[num_envs * obs_size]    # Engine writes            │
│ Actions:      float[num_envs * act_size]    # Client writes            │
│ Rewards:      float[num_envs]               # Engine writes            │
│ Dones:        u8[num_envs]                  # Engine writes            │
│ Truncateds:   u8[num_envs]                  # Engine writes            │
│ Reset flags:  u8[num_envs]                  # Client writes (1=reset) │
└────────────────────────────────────────────────────────────────────────┘

┌─────────────────────── COMMAND RINGS (512 KB each) ────────────────────┐
│ Server→Client ring: [write_pos: u32][read_pos: u32][data...]           │
│ Client→Server ring: [write_pos: u32][read_pos: u32][data...]           │
│ Each entry: [length: u32][FlatBuffer Envelope bytes][pad to 8-align]   │
└────────────────────────────────────────────────────────────────────────┘

┌─────────────────────── IMAGE REGION (optional) ────────────────────────┐
│ byte[num_envs * width * height * channels]  # Engine writes            │
│ (allocated only when images are configured)                            │
└────────────────────────────────────────────────────────────────────────┘

Size examples:

  • 4096 envs, obs=100, act=12: ~1.9 MB hot data + 1 MB commands ≈ 3 MB total
  • 64 envs with 64×64 RGB images: ~0.1 MB hot data + 0.8 MB images ≈ 2 MB
  • 16 envs with 256×256 RGB: ~0.05 MB hot data + 3 MB images ≈ 4 MB

1.2 Lock-Step Protocol

Python                              Engine
  │                                   │
  │  1. Write actions into shm        │
  │     actions_view[:] = output      │
  │  2. Set reset_flags if needed     │
  │  3. Increment action_seq (atomic) │
  │                                   │
  │  ──── spin on frame_seq ────►     │
  │                                   │  4. Spin on action_seq
  │                                   │  5. Read actions (already in shm)
  │                                   │  6. Process reset_flags
  │                                   │  7. Step ALL envs (MuJoCo)
  │                                   │  8. Agents write obs to StateBuffer
  │                                   │     (which IS the shm obs region)
  │                                   │  9. Write rewards/dones
  │                                   │  10. Increment frame_seq (atomic)
  │  ◄──── frame_seq changed ────     │
  │                                   │
  │  11. obs numpy view ready (0-copy)│
  │  12. Read rewards, dones          │
  │  13. Compute policy               │
  │  Go to 1                          │

Sync primitives:

  • C# engine: Interlocked.Read/Exchange on memory-mapped region
  • Python: ctypes.c_uint64.from_buffer(mm, offset) for atomic-width reads
  • Spin-wait: Thread.SpinWait(1) on C#, tight loop on Python (optionally time.sleep(0) to yield)

1.3 AgentBatch Buffer Mapping

File: Hazel-ScriptCore/Source/Hazel/Learn/AgentBatch.cs (line 74)

// In Setup():
if (SharedMemoryTransport.IsActive)
{
    StateBuffer = SharedMemoryTransport.GetObservationBuffer(m_Count, StateSize);
    ActionBuffer = SharedMemoryTransport.GetActionBuffer(m_Count, ActionSize);
    CtrlBuffer = new float[m_Count * ActionSize]; // Internal only, stays on heap
}
else
{
    StateBuffer = new float[m_Count * StateSize];
    ActionBuffer = new float[m_Count * ActionSize];
    CtrlBuffer = new float[m_Count * ActionSize];
}

Agents already bind to StateBuffer via offsets (line 87). No agent code changes.


Phase 2: FlatBuffers Schema (Command Channel)

Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/Schema/hazel_rpc.fbs

Defines all command messages. Same schema generates C# and Python code.

Key design:

  • Envelope root: union Payload + correlation_id + MethodId + error_message
  • struct for Vec3, Quat (inline, zero-copy)
  • [float] vectors for variable-size data

Message types (mapping from current proto):

  • Agent: GetAgentSchemaReq/Resp, ResetAgentReq/Resp, SimulationContract
  • MuJoCo: GetJointStateReq/Resp, SendControlReq/Resp, GetMujocoInfoReq/Resp
  • Telemetry: GetSchemaReq/Resp, StreamReq, TelemetryFrame
  • Scene: GetInfoReq/Resp, ListEntitiesReq/Resp, Get/SetEntity, SimMode
  • Camera: ListReq/Resp, StreamReq, ImageFrame
  • Viewport: GetInfoReq/Resp, StreamReq, ImageFrame
  • Debug: DrawReq/Resp

Note: StepRequest/StepResponse NOT needed in FlatBuffers — the hot path uses raw shared memory arrays directly.

Code generation:

flatc --csharp -o Generated/ Schema/hazel_rpc.fbs
flatc --python -o luckyrobots/generated/ Schema/hazel_rpc.fbs

Dependencies:

  • Add: Google.FlatBuffers NuGet
  • Eventually remove: Grpc.Core, Grpc.Tools, Google.Protobuf

Phase 3: Shared Memory Transport Implementation

3.1 Transport Core

Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/Transport/SharedMemoryTransport.cs

public sealed class SharedMemoryTransport : IDisposable
{
    private MemoryMappedFile _mmf;
    private unsafe byte* _basePtr;
    private Thread _commandReaderThread;

    // Called during engine startup, creates the shared memory region
    public void Create(SharedMemoryConfig config);

    // Hot data: returns managed float[] backed by shared memory
    public float[] GetObservationBuffer(int numEnvs, int obsSize);
    public float[] GetActionBuffer(int numEnvs, int actSize);

    // Atomic signals
    public void IncrementFrameSeq();
    public long ReadActionSeq();
    public void SpinWaitForAction(long expectedSeq);

    // Rewards/dones/resets (engine writes, client reads/writes)
    public Span<float> GetRewardsSpan();
    public Span<byte> GetDonesSpan();
    public Span<byte> GetResetsSpan();

    // Command channel
    public void SendCommand(ReadOnlySpan<byte> flatBufferEnvelope);
    public bool TryReceiveCommand(out ArraySegment<byte> envelope);

    // Lifecycle
    public static bool IsActive { get; }
    public void Dispose();
}

3.2 Ring Buffer

Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/Transport/RingBuffer.cs

Lock-free SPSC (single-producer, single-consumer) ring buffer for the command channel:

  • Write: append [length: u32][payload][padding], advance write_pos atomically
  • Read: read from read_pos, advance read_pos atomically
  • Wrap-around handling
  • Full detection: writer blocks if ring is full

3.3 Managed Array over Shared Memory

C# float[] is a managed object with a header — you can't directly alias it over shared memory. Two approaches:

Option A (recommended): unsafe Span/pointer wrapping

  • Keep StateBuffer as float[] on heap, but after each step, do a single Buffer.MemoryCopy from StateBuffer to shm observation region
  • Cost: one memcpy of ~1.6 MB at 50Hz ≈ 80 MB/s — negligible
  • Benefit: no unsafe GC pinning, no agent code changes, clean separation

Option B: Pin managed array to shared memory (advanced)

  • Use GCHandle.Alloc + memory-mapped overlay
  • Fragile, requires careful GC interaction
  • Not recommended for initial implementation

Decision: Option A — one memcpy after CollectState() is practically free and keeps the code clean. The "zero-copy" win is on the Python side (reading without deserialization).

Update to AgentBatch.CollectState():

public void CollectState(RobotManager robotManager)
{
    for (int i = 0; i < m_Count; i++)
        m_Agents[i].OnState();  // Writes into StateBuffer

    if (SharedMemoryTransport.IsActive)
    {
        SharedMemoryTransport.Instance.WriteObservations(StateBuffer);
        SharedMemoryTransport.Instance.WriteRewards(/* computed rewards */);
        SharedMemoryTransport.Instance.WriteDones(/* computed dones */);
        SharedMemoryTransport.Instance.IncrementFrameSeq();
    }

    ExternalAgentRegistry.PublishState(this);
}

Similarly for actions, in GetActions():

if (SharedMemoryTransport.IsActive)
{
    SharedMemoryTransport.Instance.ReadActions(ActionBuffer);
    // Process reset flags
}

3.4 Cross-Platform Shared Memory

C# (Engine side) — same code on all platforms:

// .NET MemoryMappedFile is cross-platform — runtime handles OS differences
var mmf = MemoryMappedFile.CreateNew($"LuckyEngine_{pid}", totalSize, MemoryMappedFileAccess.ReadWrite);
// AcquirePointer for unsafe byte* access
// Interlocked.Read/Exchange for atomic sequence counters (correct on x86-64 and ARM64)

Python (Client side) — use multiprocessing.shared_memory for cross-platform:

from multiprocessing.shared_memory import SharedMemory
shm = SharedMemory(name=f"LuckyEngine_{pid}", create=False)
obs = np.ndarray((num_envs, obs_size), dtype=np.float32, buffer=shm.buf, offset=obs_offset)
Windows Linux macOS
C# API MemoryMappedFile (same) Same Same
OS backend CreateFileMapping shm_open + mmap shm_open + mmap
Python API SharedMemory("name") (same) Same Same
Naming Kernel object namespace /dev/shm/ filesystem POSIX shm or temp file
Atomics Interlocked.* (C#) Same Same (ARM64: barriers handled by Interlocked)

ARM64 note (Apple Silicon, ARM cloud instances): Interlocked.* in C# emits proper dmb barriers on ARM64. On the Python side, use ctypes with explicit threading.Event or atomic wrapper rather than raw reads for sequence counters on ARM.

Shared memory name passed to Python via environment variable LUCKY_SHM_NAME or command-line arg.


Phase 4: Command Channel Dispatcher

4.1 Dispatcher

Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/RpcDispatcher.cs

Reads FlatBuffer Envelopes from the command ring buffer and routes by MethodId:

public class RpcDispatcher
{
    private readonly Dictionary<MethodId, Action<Envelope>> _handlers;

    public void Poll()  // Called each frame from main thread
    {
        while (_transport.TryReceiveCommand(out var data))
        {
            var buf = new ByteBuffer(data.Array, data.Offset);
            var envelope = Envelope.GetRootAsEnvelope(buf);
            if (_handlers.TryGetValue(envelope.Method, out var handler))
                handler(envelope);
        }
    }
}

4.2 MainThreadDispatcher

Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/RpcMainThreadDispatcher.cs

Copy of GrpcMainThreadDispatcher.cs — same ConcurrentQueue + Pump() pattern. Reused because command channel reads happen on a polling thread but handlers need main thread access.


Phase 5: Service Handler Migration

Create Hazel-ScriptCore/Source/Hazel/Net/Rpc/Handlers/:

Handler Ported From Notes
AgentServiceHandler.cs AgentServiceImpl.cs Hot path bypasses commands (raw shm). Reset/Schema via command channel.
MujocoServiceHandler.cs MujocoServiceImpl.cs JointState queries, SendControl
TelemetryServiceHandler.cs TelemetryServiceImpl.cs Streaming telemetry frames
SceneServiceHandler.cs SceneServiceImpl.cs Entity queries, sim mode
CameraServiceHandler.cs CameraServiceImpl.cs Image streaming (writes to shm image region)
ViewportServiceHandler.cs ViewportServiceImpl.cs Viewport streaming
DebugServiceHandler.cs DebugServiceImpl.cs Debug draw queue

5.1 ExternalAgentRegistry Changes

Modify: Hazel-ScriptCore/Source/Hazel/Learn/Utilities/ExternalAgentRegistry.cs

  • PublishState(): In shm mode, becomes just SharedMemoryTransport.WriteObservations() + IncrementFrameSeq()
  • WaitAndApplyExternalActions(): In shm mode, reads action_seq, copies from shm action region to ActionBuffer
  • Decouple from Hazel.Rpc.SimulationContract (protobuf type) → use FlatBuffer type or plain C# record
  • Remove s_LatestState / s_LatestAction thread-safe snapshots (shared memory replaces this)

5.2 ExternalActionGate

File: Hazel/src/Hazel/Learn/ExternalActionGate.h/cppno changes needed

The shared memory transport's polling sets ActionReady when action_seq increments, same signal as before.


Phase 6: Host + Editor Panel

Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/RpcHost.cs

  • Replaces GrpcHost.cs. Creates SharedMemoryTransport + RpcDispatcher.
  • Start/Stop/IsRunning lifecycle.

Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/RpcConfig.cs

  • Shared memory name, ring buffer size, per-service enable flags, FPS limits

Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/Editor/EditorRpcApi.cs

  • Replaces EditorGrpcApi.cs — static Coral bridge methods

Rename: LuckyEditor/src/Panels/GrpcPanel.h/cppRpcPanel.h/cpp

  • Shared memory name display, connection status, service toggles

Move: Hazel-ScriptCore/Source/Hazel/Net/Grpc/Capture/CaptureBridge.csRpc/Capture/


Phase 7: Python Client (luckyrobots)

7.1 SharedMemoryClient

import mmap, ctypes, struct
import numpy as np

class SharedMemoryClient:
    def __init__(self, shm_name: str):
        # Windows: named shared memory
        self._mm = mmap.mmap(-1, 0, tagname=shm_name, access=mmap.ACCESS_WRITE)
        self._parse_header()

        # Numpy VIEWS into shared memory — true zero-copy reads
        self.obs = np.ndarray(
            (self.num_envs, self.obs_size), dtype=np.float32,
            buffer=self._mm, offset=self._obs_offset)
        self.actions = np.ndarray(
            (self.num_envs, self.act_size), dtype=np.float32,
            buffer=self._mm, offset=self._act_offset)
        self.rewards = np.ndarray(
            (self.num_envs,), dtype=np.float32,
            buffer=self._mm, offset=self._rewards_offset)
        self.dones = np.ndarray(
            (self.num_envs,), dtype=np.uint8,
            buffer=self._mm, offset=self._dones_offset)
        self.resets = np.ndarray(
            (self.num_envs,), dtype=np.uint8,
            buffer=self._mm, offset=self._resets_offset)

    def step(self, actions: np.ndarray):
        """Zero-copy vectorized step."""
        self.actions[:] = actions                   # Direct write to shm
        self._increment_action_seq()
        self._spin_wait_frame_seq()
        return self.obs, self.rewards, self.dones   # Numpy views, no copy

    def reset(self, env_ids=None, contract=None):
        """Reset specific envs. Contract sent via command channel."""
        if env_ids is None:
            self.resets[:] = 1
        else:
            self.resets[env_ids] = 1
        if contract:
            self._send_command(build_reset_contract_fb(contract))

    def get_schema(self):
        """Query agent schema via command channel."""
        self._send_command(build_get_schema_fb())
        return self._recv_command_blocking()

7.2 GPU Integration (Python side)

# CPU observations → GPU tensors for training
obs_tensor = torch.from_numpy(client.obs).pin_memory().cuda(non_blocking=True)

# Future: when GPU MuJoCo is available, obs will be CUDA IPC tensors
# client.obs_cuda = torch.cuda.tensor_from_ipc_handle(client.gpu_obs_handle)

Phase 8: Cleanup

  • Delete Hazel-ScriptCore/Source/Hazel/Net/Grpc/ (entire directory)
  • Remove Grpc.Core, Grpc.Tools, Google.Protobuf from project
  • Remove hazel_rpc.proto and generated gRPC code
  • Remove scripts/grpc_env.py

GPU Data Path (Future)

Current: CPU MuJoCo + Vulkan Images

Data Path Copies
Proprioceptive C# StateBuffer → memcpy → shm → numpy view → .cuda() 1 memcpy + 1 GPU upload
Images (local) Vulkan render → VK_EXTERNAL_MEMORYcudaImportExternalMemory → CUDA tensor 0 (GPU-to-GPU)

For Vulkan image sharing:

  • Engine creates render targets with VkExternalMemoryImageCreateInfo (VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT)
  • Exports Win32 handle via vkGetMemoryWin32HandleKHR
  • Handle values stored in shm header
  • Python imports via CuPy cudaImportExternalMemory → device pointer → torch tensor

Future: GPU MuJoCo

Data Path Copies
Proprioceptive CUDA buffer → cudaIpcGetMemHandle → handle in shm header → Python cudaIpcOpenMemHandle 0 (same GPU memory)
Images Same as current (already on GPU) 0

Changes when GPU MuJoCo arrives:

  1. AgentBatch.StateBuffer becomes CUDA device buffer
  2. SharedMemoryTransport exports cudaIpcMemHandle_t in header
  3. Python reads handle, maps as CUDA tensor
  4. Atomic sync stays in CPU shm header (lightweight, always CPU)

Implementation Order

  1. Shared memory layout + transport (Phase 1 + 3) — get hot data path working
  2. FlatBuffers schema + codegen (Phase 2) — command channel format
  3. AgentServiceHandler (Phase 5 partial) — Step via shm, Reset/Schema via commands
  4. Python SharedMemoryClient — test 4096 envs at 50Hz
  5. Remaining service handlers (Phase 5 complete)
  6. Editor panel (Phase 6)
  7. Cleanup (Phase 8) — remove gRPC

gRPC stays operational alongside during migration. Config flag selects transport.


Files Summary

New Files

Hazel-ScriptCore/Source/Hazel/Net/Rpc/
  Schema/hazel_rpc.fbs
  Generated/                              (flatc output)
  RpcHost.cs
  RpcConfig.cs
  RpcDispatcher.cs
  RpcMainThreadDispatcher.cs
  Transport/SharedMemoryTransport.cs
  Transport/RingBuffer.cs
  Handlers/AgentServiceHandler.cs
  Handlers/MujocoServiceHandler.cs
  Handlers/TelemetryServiceHandler.cs
  Handlers/SceneServiceHandler.cs
  Handlers/CameraServiceHandler.cs
  Handlers/ViewportServiceHandler.cs
  Handlers/DebugServiceHandler.cs
  Editor/EditorRpcApi.cs
  Capture/CaptureBridge.cs                (moved from Grpc/)

Modified Files

Hazel-ScriptCore/Source/Hazel/Learn/AgentBatch.cs
  → CollectState() writes StateBuffer to shm + increments frame_seq
  → GetActions() reads ActionBuffer from shm when action_seq changes

Hazel-ScriptCore/Source/Hazel/Learn/Utilities/ExternalAgentRegistry.cs
  → Simplify for shm mode, decouple from protobuf SimulationContract

LuckyEditor/src/Panels/GrpcPanel.h/cpp → RpcPanel.h/cpp
  → Shared memory status, service toggles

Hazel-ScriptCore project file
  → Add Google.FlatBuffers, eventually remove gRPC NuGets

Deleted Files (Phase 8)

Hazel-ScriptCore/Source/Hazel/Net/Grpc/  (entire directory)
scripts/grpc_env.py

Verification

  1. Shared memory layout: Engine creates shm, Python opens and reads header — verify all offsets correct
  2. Zero-copy proof: client.obs.ctypes.data == shm_base + obs_offset — same memory address
  3. 4096-env throughput: 4096 envs at 50Hz, measure step latency (target: <1ms batch step)
  4. Lock-step correctness: frame_seq increments exactly once per step, no missed/double actions
  5. Reset via command channel: FlatBuffer ResetRequest with SimulationContract, verify DR applies
  6. Latency benchmark: Shared memory step vs gRPC step round-trip comparison
  7. Image region: Camera frames written to shm image region, Python reads as numpy, pixels match
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment