Replace gRPC with Shared Memory for Local Zero-Copy Communication

Context

LuckyEngine and luckyrobots (Python) always run on the same machine — even in cloud (GCP/AWS), both are co-located on the same node. No remote transport needed. The goal is true zero-copy shared memory replacing gRPC+Protobuf for all communication.

Key insight: AgentBatch.StateBuffer is already a flat float[num_envs * obs_size] (AgentBatch.cs:74). By allocating this directly in shared memory, agents write observations straight into shared memory with zero copies end-to-end.

Architecture

┌─────────────────────────────────────────────────┐
│              LuckyEngine (C#/C++)                │
│                                                  │
│  AgentBatch.StateBuffer ──► Shared Memory        │
│  AgentBatch.ActionBuffer ◄── Shared Memory       │
│       (direct buffer mapping, no copies)         │
│                                                  │
│  Command Channel: FlatBuffers in shm ring buffer │
│  (reset, schema, scene queries, debug, etc.)     │
└──────────────────┬───────────────────────────────┘
           Shared Memory (MemoryMappedFile / mmap)
                   │
┌──────────────────┴───────────────────────────────┐
│              luckyrobots (Python)                  │
│                                                    │
│  obs = np.ndarray(buffer=shm, ...)  # zero-copy   │
│  actions_view[:] = policy_output    # direct write │
│                                                    │
│  Command channel: FlatBuffers for reset/schema     │
└────────────────────────────────────────────────────┘

Two-Tier Design

Tier 1 — Hot Data (raw arrays in shared memory, no serialization): Observations, actions, rewards, dones — exchanged every step at 50Hz+ for up to 4096 envs.

Tier 2 — Command Channel (FlatBuffers in shared memory ring buffer): Setup, reset, schema queries, image streaming, scene inspection, debug drawing. All 7 current gRPC services migrate here.

Phase 1: Shared Memory Layout + Lock-Step Protocol

1.1 Memory Layout

┌─────────────────────────── HEADER (4096 bytes) ─────────────────────────┐
│ magic: u32 = 0x4C524243 ("LRBC")                                       │
│ version: u32 = 1                                                        │
│ server_pid: u32           # Engine process ID                           │
│ client_pid: u32           # Python process ID (set by client on attach) │
│                                                                         │
│ mode: u8                  # 0=Training (homogeneous), 1=Inference       │
│ num_envs: u32             # Number of environments                      │
│ obs_size: u32             # Observation floats per env                  │
│ act_size: u32             # Action floats per env                       │
│                                                                         │
│ frame_seq: u64 (atomic)   # Engine increments after writing all obs     │
│ action_seq: u64 (atomic)  # Client increments after writing all actions │
│                                                                         │
│ obs_offset: u64           # Byte offset to observations region          │
│ act_offset: u64           # Byte offset to actions region               │
│ rewards_offset: u64       # Byte offset to rewards                      │
│ dones_offset: u64         # Byte offset to dones                        │
│ resets_offset: u64        # Byte offset to reset flags                  │
│ cmd_s2c_offset: u64       # Server→Client command ring buffer offset    │
│ cmd_c2s_offset: u64       # Client→Server command ring buffer offset    │
│ cmd_ring_size: u32        # Size of each command ring buffer             │
│ image_offset: u64         # Byte offset to image region (0 if none)     │
│ image_width: u32                                                        │
│ image_height: u32                                                       │
│ image_channels: u32                                                     │
│                                                                         │
│ [GPU extensions - for future GPU MuJoCo]                                │
│ gpu_available: u8         # 1 if GPU buffers exported                   │
│ gpu_obs_mode: u8          # 0=CPU, 1=CUDA_IPC, 2=VULKAN_EXTERNAL       │
│ gpu_obs_handle: byte[64]  # cudaIpcMemHandle_t                          │
│ gpu_device_id: u32        # CUDA device ordinal                         │
│                                                                         │
│ reserved[...] to 4096 bytes                                             │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────── HOT DATA REGION ────────────────────────────────┐
│ Observations: float[num_envs * obs_size]    # Engine writes            │
│ Actions:      float[num_envs * act_size]    # Client writes            │
│ Rewards:      float[num_envs]               # Engine writes            │
│ Dones:        u8[num_envs]                  # Engine writes            │
│ Truncateds:   u8[num_envs]                  # Engine writes            │
│ Reset flags:  u8[num_envs]                  # Client writes (1=reset) │
└────────────────────────────────────────────────────────────────────────┘

┌─────────────────────── COMMAND RINGS (512 KB each) ────────────────────┐
│ Server→Client ring: [write_pos: u32][read_pos: u32][data...]           │
│ Client→Server ring: [write_pos: u32][read_pos: u32][data...]           │
│ Each entry: [length: u32][FlatBuffer Envelope bytes][pad to 8-align]   │
└────────────────────────────────────────────────────────────────────────┘

┌─────────────────────── IMAGE REGION (optional) ────────────────────────┐
│ byte[num_envs * width * height * channels]  # Engine writes            │
│ (allocated only when images are configured)                            │
└────────────────────────────────────────────────────────────────────────┘

Size examples:

4096 envs, obs=100, act=12: ~1.9 MB hot data + 1 MB commands ≈ 3 MB total
64 envs with 64×64 RGB images: ~0.1 MB hot data + 0.8 MB images ≈ 2 MB
16 envs with 256×256 RGB: ~0.05 MB hot data + 3 MB images ≈ 4 MB

1.2 Lock-Step Protocol

Python                              Engine
  │                                   │
  │  1. Write actions into shm        │
  │     actions_view[:] = output      │
  │  2. Set reset_flags if needed     │
  │  3. Increment action_seq (atomic) │
  │                                   │
  │  ──── spin on frame_seq ────►     │
  │                                   │  4. Spin on action_seq
  │                                   │  5. Read actions (already in shm)
  │                                   │  6. Process reset_flags
  │                                   │  7. Step ALL envs (MuJoCo)
  │                                   │  8. Agents write obs to StateBuffer
  │                                   │     (which IS the shm obs region)
  │                                   │  9. Write rewards/dones
  │                                   │  10. Increment frame_seq (atomic)
  │  ◄──── frame_seq changed ────     │
  │                                   │
  │  11. obs numpy view ready (0-copy)│
  │  12. Read rewards, dones          │
  │  13. Compute policy               │
  │  Go to 1                          │

Sync primitives:

C# engine: Interlocked.Read/Exchange on memory-mapped region
Python: ctypes.c_uint64.from_buffer(mm, offset) for atomic-width reads
Spin-wait: Thread.SpinWait(1) on C#, tight loop on Python (optionally time.sleep(0) to yield)

1.3 AgentBatch Buffer Mapping

File: Hazel-ScriptCore/Source/Hazel/Learn/AgentBatch.cs (line 74)

// In Setup():
if (SharedMemoryTransport.IsActive)
{
    StateBuffer = SharedMemoryTransport.GetObservationBuffer(m_Count, StateSize);
    ActionBuffer = SharedMemoryTransport.GetActionBuffer(m_Count, ActionSize);
    CtrlBuffer = new float[m_Count * ActionSize]; // Internal only, stays on heap
}
else
{
    StateBuffer = new float[m_Count * StateSize];
    ActionBuffer = new float[m_Count * ActionSize];
    CtrlBuffer = new float[m_Count * ActionSize];
}

Agents already bind to StateBuffer via offsets (line 87). No agent code changes.

Phase 2: FlatBuffers Schema (Command Channel)

Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/Schema/hazel_rpc.fbs

Defines all command messages. Same schema generates C# and Python code.

Key design:

Envelope root: union Payload + correlation_id + MethodId + error_message
struct for Vec3, Quat (inline, zero-copy)
[float] vectors for variable-size data

Message types (mapping from current proto):

Agent: GetAgentSchemaReq/Resp, ResetAgentReq/Resp, SimulationContract
MuJoCo: GetJointStateReq/Resp, SendControlReq/Resp, GetMujocoInfoReq/Resp
Telemetry: GetSchemaReq/Resp, StreamReq, TelemetryFrame
Scene: GetInfoReq/Resp, ListEntitiesReq/Resp, Get/SetEntity, SimMode
Camera: ListReq/Resp, StreamReq, ImageFrame
Viewport: GetInfoReq/Resp, StreamReq, ImageFrame
Debug: DrawReq/Resp

Note: StepRequest/StepResponse NOT needed in FlatBuffers — the hot path uses raw shared memory arrays directly.

Code generation:

flatc --csharp -o Generated/ Schema/hazel_rpc.fbs
flatc --python -o luckyrobots/generated/ Schema/hazel_rpc.fbs

Dependencies:

Add: Google.FlatBuffers NuGet
Eventually remove: Grpc.Core, Grpc.Tools, Google.Protobuf

Phase 3: Shared Memory Transport Implementation

3.1 Transport Core

Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/Transport/SharedMemoryTransport.cs

public sealed class SharedMemoryTransport : IDisposable
{
    private MemoryMappedFile _mmf;
    private unsafe byte* _basePtr;
    private Thread _commandReaderThread;

    // Called during engine startup, creates the shared memory region
    public void Create(SharedMemoryConfig config);

    // Hot data: returns managed float[] backed by shared memory
    public float[] GetObservationBuffer(int numEnvs, int obsSize);
    public float[] GetActionBuffer(int numEnvs, int actSize);

    // Atomic signals
    public void IncrementFrameSeq();
    public long ReadActionSeq();
    public void SpinWaitForAction(long expectedSeq);

    // Rewards/dones/resets (engine writes, client reads/writes)
    public Span<float> GetRewardsSpan();
    public Span<byte> GetDonesSpan();
    public Span<byte> GetResetsSpan();

    // Command channel
    public void SendCommand(ReadOnlySpan<byte> flatBufferEnvelope);
    public bool TryReceiveCommand(out ArraySegment<byte> envelope);

    // Lifecycle
    public static bool IsActive { get; }
    public void Dispose();
}

3.2 Ring Buffer

Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/Transport/RingBuffer.cs

Lock-free SPSC (single-producer, single-consumer) ring buffer for the command channel:

Write: append [length: u32][payload][padding], advance write_pos atomically
Read: read from read_pos, advance read_pos atomically
Wrap-around handling
Full detection: writer blocks if ring is full

3.3 Managed Array over Shared Memory

C# float[] is a managed object with a header — you can't directly alias it over shared memory. Two approaches:

Option A (recommended): unsafe Span/pointer wrapping

Keep StateBuffer as float[] on heap, but after each step, do a single Buffer.MemoryCopy from StateBuffer to shm observation region
Cost: one memcpy of ~1.6 MB at 50Hz ≈ 80 MB/s — negligible
Benefit: no unsafe GC pinning, no agent code changes, clean separation

Option B: Pin managed array to shared memory (advanced)

Use GCHandle.Alloc + memory-mapped overlay
Fragile, requires careful GC interaction
Not recommended for initial implementation

Decision: Option A — one memcpy after CollectState() is practically free and keeps the code clean. The "zero-copy" win is on the Python side (reading without deserialization).

Update to AgentBatch.CollectState():

public void CollectState(RobotManager robotManager)
{
    for (int i = 0; i < m_Count; i++)
        m_Agents[i].OnState();  // Writes into StateBuffer

    if (SharedMemoryTransport.IsActive)
    {
        SharedMemoryTransport.Instance.WriteObservations(StateBuffer);
        SharedMemoryTransport.Instance.WriteRewards(/* computed rewards */);
        SharedMemoryTransport.Instance.WriteDones(/* computed dones */);
        SharedMemoryTransport.Instance.IncrementFrameSeq();
    }

    ExternalAgentRegistry.PublishState(this);
}

Similarly for actions, in GetActions():

if (SharedMemoryTransport.IsActive)
{
    SharedMemoryTransport.Instance.ReadActions(ActionBuffer);
    // Process reset flags
}

3.4 Cross-Platform Shared Memory

C# (Engine side) — same code on all platforms:

// .NET MemoryMappedFile is cross-platform — runtime handles OS differences
var mmf = MemoryMappedFile.CreateNew($"LuckyEngine_{pid}", totalSize, MemoryMappedFileAccess.ReadWrite);
// AcquirePointer for unsafe byte* access
// Interlocked.Read/Exchange for atomic sequence counters (correct on x86-64 and ARM64)

Python (Client side) — use multiprocessing.shared_memory for cross-platform:

from multiprocessing.shared_memory import SharedMemory
shm = SharedMemory(name=f"LuckyEngine_{pid}", create=False)
obs = np.ndarray((num_envs, obs_size), dtype=np.float32, buffer=shm.buf, offset=obs_offset)

	Windows	Linux	macOS
C# API	`MemoryMappedFile` (same)	Same	Same
OS backend	`CreateFileMapping`	`shm_open` + `mmap`	`shm_open` + `mmap`
Python API	`SharedMemory("name")` (same)	Same	Same
Naming	Kernel object namespace	`/dev/shm/` filesystem	POSIX shm or temp file
Atomics	`Interlocked.*` (C#)	Same	Same (ARM64: barriers handled by `Interlocked`)

ARM64 note (Apple Silicon, ARM cloud instances): Interlocked.* in C# emits proper dmb barriers on ARM64. On the Python side, use ctypes with explicit threading.Event or atomic wrapper rather than raw reads for sequence counters on ARM.

Shared memory name passed to Python via environment variable LUCKY_SHM_NAME or command-line arg.

Phase 4: Command Channel Dispatcher

4.1 Dispatcher

Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/RpcDispatcher.cs

Reads FlatBuffer Envelopes from the command ring buffer and routes by MethodId:

public class RpcDispatcher
{
    private readonly Dictionary<MethodId, Action<Envelope>> _handlers;

    public void Poll()  // Called each frame from main thread
    {
        while (_transport.TryReceiveCommand(out var data))
        {
            var buf = new ByteBuffer(data.Array, data.Offset);
            var envelope = Envelope.GetRootAsEnvelope(buf);
            if (_handlers.TryGetValue(envelope.Method, out var handler))
                handler(envelope);
        }
    }
}

4.2 MainThreadDispatcher

Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/RpcMainThreadDispatcher.cs

Copy of GrpcMainThreadDispatcher.cs — same ConcurrentQueue + Pump() pattern. Reused because command channel reads happen on a polling thread but handlers need main thread access.

Phase 5: Service Handler Migration

Create Hazel-ScriptCore/Source/Hazel/Net/Rpc/Handlers/:

Handler	Ported From	Notes
`AgentServiceHandler.cs`	`AgentServiceImpl.cs`	Hot path bypasses commands (raw shm). Reset/Schema via command channel.
`MujocoServiceHandler.cs`	`MujocoServiceImpl.cs`	JointState queries, SendControl
`TelemetryServiceHandler.cs`	`TelemetryServiceImpl.cs`	Streaming telemetry frames
`SceneServiceHandler.cs`	`SceneServiceImpl.cs`	Entity queries, sim mode
`CameraServiceHandler.cs`	`CameraServiceImpl.cs`	Image streaming (writes to shm image region)
`ViewportServiceHandler.cs`	`ViewportServiceImpl.cs`	Viewport streaming
`DebugServiceHandler.cs`	`DebugServiceImpl.cs`	Debug draw queue

5.1 ExternalAgentRegistry Changes

Modify: Hazel-ScriptCore/Source/Hazel/Learn/Utilities/ExternalAgentRegistry.cs

PublishState(): In shm mode, becomes just SharedMemoryTransport.WriteObservations() + IncrementFrameSeq()
WaitAndApplyExternalActions(): In shm mode, reads action_seq, copies from shm action region to ActionBuffer
Decouple from Hazel.Rpc.SimulationContract (protobuf type) → use FlatBuffer type or plain C# record
Remove s_LatestState / s_LatestAction thread-safe snapshots (shared memory replaces this)

5.2 ExternalActionGate

File: Hazel/src/Hazel/Learn/ExternalActionGate.h/cpp — no changes needed

The shared memory transport's polling sets ActionReady when action_seq increments, same signal as before.

Phase 6: Host + Editor Panel

Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/RpcHost.cs

Replaces GrpcHost.cs. Creates SharedMemoryTransport + RpcDispatcher.
Start/Stop/IsRunning lifecycle.

Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/RpcConfig.cs

Shared memory name, ring buffer size, per-service enable flags, FPS limits

Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/Editor/EditorRpcApi.cs

Replaces EditorGrpcApi.cs — static Coral bridge methods

Rename: LuckyEditor/src/Panels/GrpcPanel.h/cpp → RpcPanel.h/cpp

Shared memory name display, connection status, service toggles

Move: Hazel-ScriptCore/Source/Hazel/Net/Grpc/Capture/CaptureBridge.cs → Rpc/Capture/

Phase 7: Python Client (luckyrobots)

7.1 SharedMemoryClient

import mmap, ctypes, struct
import numpy as np

class SharedMemoryClient:
    def __init__(self, shm_name: str):
        # Windows: named shared memory
        self._mm = mmap.mmap(-1, 0, tagname=shm_name, access=mmap.ACCESS_WRITE)
        self._parse_header()

        # Numpy VIEWS into shared memory — true zero-copy reads
        self.obs = np.ndarray(
            (self.num_envs, self.obs_size), dtype=np.float32,
            buffer=self._mm, offset=self._obs_offset)
        self.actions = np.ndarray(
            (self.num_envs, self.act_size), dtype=np.float32,
            buffer=self._mm, offset=self._act_offset)
        self.rewards = np.ndarray(
            (self.num_envs,), dtype=np.float32,
            buffer=self._mm, offset=self._rewards_offset)
        self.dones = np.ndarray(
            (self.num_envs,), dtype=np.uint8,
            buffer=self._mm, offset=self._dones_offset)
        self.resets = np.ndarray(
            (self.num_envs,), dtype=np.uint8,
            buffer=self._mm, offset=self._resets_offset)

    def step(self, actions: np.ndarray):
        """Zero-copy vectorized step."""
        self.actions[:] = actions                   # Direct write to shm
        self._increment_action_seq()
        self._spin_wait_frame_seq()
        return self.obs, self.rewards, self.dones   # Numpy views, no copy

    def reset(self, env_ids=None, contract=None):
        """Reset specific envs. Contract sent via command channel."""
        if env_ids is None:
            self.resets[:] = 1
        else:
            self.resets[env_ids] = 1
        if contract:
            self._send_command(build_reset_contract_fb(contract))

    def get_schema(self):
        """Query agent schema via command channel."""
        self._send_command(build_get_schema_fb())
        return self._recv_command_blocking()

7.2 GPU Integration (Python side)

# CPU observations → GPU tensors for training
obs_tensor = torch.from_numpy(client.obs).pin_memory().cuda(non_blocking=True)

# Future: when GPU MuJoCo is available, obs will be CUDA IPC tensors
# client.obs_cuda = torch.cuda.tensor_from_ipc_handle(client.gpu_obs_handle)

Phase 8: Cleanup

Delete Hazel-ScriptCore/Source/Hazel/Net/Grpc/ (entire directory)
Remove Grpc.Core, Grpc.Tools, Google.Protobuf from project
Remove hazel_rpc.proto and generated gRPC code
Remove scripts/grpc_env.py

GPU Data Path (Future)

Current: CPU MuJoCo + Vulkan Images

Data	Path	Copies
Proprioceptive	C# StateBuffer → memcpy → shm → numpy view → `.cuda()`	1 memcpy + 1 GPU upload
Images (local)	Vulkan render → `VK_EXTERNAL_MEMORY` → `cudaImportExternalMemory` → CUDA tensor	0 (GPU-to-GPU)

For Vulkan image sharing:

Engine creates render targets with VkExternalMemoryImageCreateInfo (VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT)
Exports Win32 handle via vkGetMemoryWin32HandleKHR
Handle values stored in shm header
Python imports via CuPy cudaImportExternalMemory → device pointer → torch tensor

Future: GPU MuJoCo

Data	Path	Copies
Proprioceptive	CUDA buffer → `cudaIpcGetMemHandle` → handle in shm header → Python `cudaIpcOpenMemHandle`	0 (same GPU memory)
Images	Same as current (already on GPU)	0

Changes when GPU MuJoCo arrives:

AgentBatch.StateBuffer becomes CUDA device buffer
SharedMemoryTransport exports cudaIpcMemHandle_t in header
Python reads handle, maps as CUDA tensor
Atomic sync stays in CPU shm header (lightweight, always CPU)

Implementation Order

Shared memory layout + transport (Phase 1 + 3) — get hot data path working
FlatBuffers schema + codegen (Phase 2) — command channel format
AgentServiceHandler (Phase 5 partial) — Step via shm, Reset/Schema via commands
Python SharedMemoryClient — test 4096 envs at 50Hz
Remaining service handlers (Phase 5 complete)
Editor panel (Phase 6)
Cleanup (Phase 8) — remove gRPC

gRPC stays operational alongside during migration. Config flag selects transport.

Files Summary

New Files

Hazel-ScriptCore/Source/Hazel/Net/Rpc/
  Schema/hazel_rpc.fbs
  Generated/                              (flatc output)
  RpcHost.cs
  RpcConfig.cs
  RpcDispatcher.cs
  RpcMainThreadDispatcher.cs
  Transport/SharedMemoryTransport.cs
  Transport/RingBuffer.cs
  Handlers/AgentServiceHandler.cs
  Handlers/MujocoServiceHandler.cs
  Handlers/TelemetryServiceHandler.cs
  Handlers/SceneServiceHandler.cs
  Handlers/CameraServiceHandler.cs
  Handlers/ViewportServiceHandler.cs
  Handlers/DebugServiceHandler.cs
  Editor/EditorRpcApi.cs
  Capture/CaptureBridge.cs                (moved from Grpc/)

Modified Files

Hazel-ScriptCore/Source/Hazel/Learn/AgentBatch.cs
  → CollectState() writes StateBuffer to shm + increments frame_seq
  → GetActions() reads ActionBuffer from shm when action_seq changes

Hazel-ScriptCore/Source/Hazel/Learn/Utilities/ExternalAgentRegistry.cs
  → Simplify for shm mode, decouple from protobuf SimulationContract

LuckyEditor/src/Panels/GrpcPanel.h/cpp → RpcPanel.h/cpp
  → Shared memory status, service toggles

Hazel-ScriptCore project file
  → Add Google.FlatBuffers, eventually remove gRPC NuGets

Deleted Files (Phase 8)

Hazel-ScriptCore/Source/Hazel/Net/Grpc/  (entire directory)
scripts/grpc_env.py

Verification

Shared memory layout: Engine creates shm, Python opens and reads header — verify all offsets correct
Zero-copy proof: client.obs.ctypes.data == shm_base + obs_offset — same memory address
4096-env throughput: 4096 envs at 50Hz, measure step latency (target: <1ms batch step)
Lock-step correctness: frame_seq increments exactly once per step, no missed/double actions
Reset via command channel: FlatBuffer ResetRequest with SimulationContract, verify DR applies
Latency benchmark: Shared memory step vs gRPC step round-trip comparison
Image region: Camera frames written to shm image region, Python reads as numpy, pixels match

ethanmclark1/grpc_to_ipc.md

Select an option

No results found