LuckyEngine and luckyrobots (Python) always run on the same machine — even in cloud (GCP/AWS), both are co-located on the same node. No remote transport needed. The goal is true zero-copy shared memory replacing gRPC+Protobuf for all communication.
Key insight: AgentBatch.StateBuffer is already a flat float[num_envs * obs_size] (AgentBatch.cs:74). By allocating this directly in shared memory, agents write observations straight into shared memory with zero copies end-to-end.
┌─────────────────────────────────────────────────┐
│ LuckyEngine (C#/C++) │
│ │
│ AgentBatch.StateBuffer ──► Shared Memory │
│ AgentBatch.ActionBuffer ◄── Shared Memory │
│ (direct buffer mapping, no copies) │
│ │
│ Command Channel: FlatBuffers in shm ring buffer │
│ (reset, schema, scene queries, debug, etc.) │
└──────────────────┬───────────────────────────────┘
Shared Memory (MemoryMappedFile / mmap)
│
┌──────────────────┴───────────────────────────────┐
│ luckyrobots (Python) │
│ │
│ obs = np.ndarray(buffer=shm, ...) # zero-copy │
│ actions_view[:] = policy_output # direct write │
│ │
│ Command channel: FlatBuffers for reset/schema │
└────────────────────────────────────────────────────┘
Tier 1 — Hot Data (raw arrays in shared memory, no serialization): Observations, actions, rewards, dones — exchanged every step at 50Hz+ for up to 4096 envs.
Tier 2 — Command Channel (FlatBuffers in shared memory ring buffer): Setup, reset, schema queries, image streaming, scene inspection, debug drawing. All 7 current gRPC services migrate here.
┌─────────────────────────── HEADER (4096 bytes) ─────────────────────────┐
│ magic: u32 = 0x4C524243 ("LRBC") │
│ version: u32 = 1 │
│ server_pid: u32 # Engine process ID │
│ client_pid: u32 # Python process ID (set by client on attach) │
│ │
│ mode: u8 # 0=Training (homogeneous), 1=Inference │
│ num_envs: u32 # Number of environments │
│ obs_size: u32 # Observation floats per env │
│ act_size: u32 # Action floats per env │
│ │
│ frame_seq: u64 (atomic) # Engine increments after writing all obs │
│ action_seq: u64 (atomic) # Client increments after writing all actions │
│ │
│ obs_offset: u64 # Byte offset to observations region │
│ act_offset: u64 # Byte offset to actions region │
│ rewards_offset: u64 # Byte offset to rewards │
│ dones_offset: u64 # Byte offset to dones │
│ resets_offset: u64 # Byte offset to reset flags │
│ cmd_s2c_offset: u64 # Server→Client command ring buffer offset │
│ cmd_c2s_offset: u64 # Client→Server command ring buffer offset │
│ cmd_ring_size: u32 # Size of each command ring buffer │
│ image_offset: u64 # Byte offset to image region (0 if none) │
│ image_width: u32 │
│ image_height: u32 │
│ image_channels: u32 │
│ │
│ [GPU extensions - for future GPU MuJoCo] │
│ gpu_available: u8 # 1 if GPU buffers exported │
│ gpu_obs_mode: u8 # 0=CPU, 1=CUDA_IPC, 2=VULKAN_EXTERNAL │
│ gpu_obs_handle: byte[64] # cudaIpcMemHandle_t │
│ gpu_device_id: u32 # CUDA device ordinal │
│ │
│ reserved[...] to 4096 bytes │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────── HOT DATA REGION ────────────────────────────────┐
│ Observations: float[num_envs * obs_size] # Engine writes │
│ Actions: float[num_envs * act_size] # Client writes │
│ Rewards: float[num_envs] # Engine writes │
│ Dones: u8[num_envs] # Engine writes │
│ Truncateds: u8[num_envs] # Engine writes │
│ Reset flags: u8[num_envs] # Client writes (1=reset) │
└────────────────────────────────────────────────────────────────────────┘
┌─────────────────────── COMMAND RINGS (512 KB each) ────────────────────┐
│ Server→Client ring: [write_pos: u32][read_pos: u32][data...] │
│ Client→Server ring: [write_pos: u32][read_pos: u32][data...] │
│ Each entry: [length: u32][FlatBuffer Envelope bytes][pad to 8-align] │
└────────────────────────────────────────────────────────────────────────┘
┌─────────────────────── IMAGE REGION (optional) ────────────────────────┐
│ byte[num_envs * width * height * channels] # Engine writes │
│ (allocated only when images are configured) │
└────────────────────────────────────────────────────────────────────────┘
Size examples:
- 4096 envs, obs=100, act=12: ~1.9 MB hot data + 1 MB commands ≈ 3 MB total
- 64 envs with 64×64 RGB images: ~0.1 MB hot data + 0.8 MB images ≈ 2 MB
- 16 envs with 256×256 RGB: ~0.05 MB hot data + 3 MB images ≈ 4 MB
Python Engine
│ │
│ 1. Write actions into shm │
│ actions_view[:] = output │
│ 2. Set reset_flags if needed │
│ 3. Increment action_seq (atomic) │
│ │
│ ──── spin on frame_seq ────► │
│ │ 4. Spin on action_seq
│ │ 5. Read actions (already in shm)
│ │ 6. Process reset_flags
│ │ 7. Step ALL envs (MuJoCo)
│ │ 8. Agents write obs to StateBuffer
│ │ (which IS the shm obs region)
│ │ 9. Write rewards/dones
│ │ 10. Increment frame_seq (atomic)
│ ◄──── frame_seq changed ──── │
│ │
│ 11. obs numpy view ready (0-copy)│
│ 12. Read rewards, dones │
│ 13. Compute policy │
│ Go to 1 │
Sync primitives:
- C# engine:
Interlocked.Read/Exchangeon memory-mapped region - Python:
ctypes.c_uint64.from_buffer(mm, offset)for atomic-width reads - Spin-wait:
Thread.SpinWait(1)on C#, tight loop on Python (optionallytime.sleep(0)to yield)
File: Hazel-ScriptCore/Source/Hazel/Learn/AgentBatch.cs (line 74)
// In Setup():
if (SharedMemoryTransport.IsActive)
{
StateBuffer = SharedMemoryTransport.GetObservationBuffer(m_Count, StateSize);
ActionBuffer = SharedMemoryTransport.GetActionBuffer(m_Count, ActionSize);
CtrlBuffer = new float[m_Count * ActionSize]; // Internal only, stays on heap
}
else
{
StateBuffer = new float[m_Count * StateSize];
ActionBuffer = new float[m_Count * ActionSize];
CtrlBuffer = new float[m_Count * ActionSize];
}Agents already bind to StateBuffer via offsets (line 87). No agent code changes.
Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/Schema/hazel_rpc.fbs
Defines all command messages. Same schema generates C# and Python code.
Key design:
- Envelope root:
union Payload+correlation_id+MethodId+error_message structfor Vec3, Quat (inline, zero-copy)[float]vectors for variable-size data
Message types (mapping from current proto):
- Agent: GetAgentSchemaReq/Resp, ResetAgentReq/Resp, SimulationContract
- MuJoCo: GetJointStateReq/Resp, SendControlReq/Resp, GetMujocoInfoReq/Resp
- Telemetry: GetSchemaReq/Resp, StreamReq, TelemetryFrame
- Scene: GetInfoReq/Resp, ListEntitiesReq/Resp, Get/SetEntity, SimMode
- Camera: ListReq/Resp, StreamReq, ImageFrame
- Viewport: GetInfoReq/Resp, StreamReq, ImageFrame
- Debug: DrawReq/Resp
Note: StepRequest/StepResponse NOT needed in FlatBuffers — the hot path uses raw shared memory arrays directly.
Code generation:
flatc --csharp -o Generated/ Schema/hazel_rpc.fbs
flatc --python -o luckyrobots/generated/ Schema/hazel_rpc.fbs
Dependencies:
- Add:
Google.FlatBuffersNuGet - Eventually remove:
Grpc.Core,Grpc.Tools,Google.Protobuf
Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/Transport/SharedMemoryTransport.cs
public sealed class SharedMemoryTransport : IDisposable
{
private MemoryMappedFile _mmf;
private unsafe byte* _basePtr;
private Thread _commandReaderThread;
// Called during engine startup, creates the shared memory region
public void Create(SharedMemoryConfig config);
// Hot data: returns managed float[] backed by shared memory
public float[] GetObservationBuffer(int numEnvs, int obsSize);
public float[] GetActionBuffer(int numEnvs, int actSize);
// Atomic signals
public void IncrementFrameSeq();
public long ReadActionSeq();
public void SpinWaitForAction(long expectedSeq);
// Rewards/dones/resets (engine writes, client reads/writes)
public Span<float> GetRewardsSpan();
public Span<byte> GetDonesSpan();
public Span<byte> GetResetsSpan();
// Command channel
public void SendCommand(ReadOnlySpan<byte> flatBufferEnvelope);
public bool TryReceiveCommand(out ArraySegment<byte> envelope);
// Lifecycle
public static bool IsActive { get; }
public void Dispose();
}Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/Transport/RingBuffer.cs
Lock-free SPSC (single-producer, single-consumer) ring buffer for the command channel:
- Write: append
[length: u32][payload][padding], advancewrite_posatomically - Read: read from
read_pos, advanceread_posatomically - Wrap-around handling
- Full detection: writer blocks if ring is full
C# float[] is a managed object with a header — you can't directly alias it over shared memory. Two approaches:
Option A (recommended): unsafe Span/pointer wrapping
- Keep
StateBufferasfloat[]on heap, but after each step, do a singleBuffer.MemoryCopyfrom StateBuffer to shm observation region - Cost: one memcpy of ~1.6 MB at 50Hz ≈ 80 MB/s — negligible
- Benefit: no unsafe GC pinning, no agent code changes, clean separation
Option B: Pin managed array to shared memory (advanced)
- Use
GCHandle.Alloc+ memory-mapped overlay - Fragile, requires careful GC interaction
- Not recommended for initial implementation
Decision: Option A — one memcpy after CollectState() is practically free and keeps the code clean. The "zero-copy" win is on the Python side (reading without deserialization).
Update to AgentBatch.CollectState():
public void CollectState(RobotManager robotManager)
{
for (int i = 0; i < m_Count; i++)
m_Agents[i].OnState(); // Writes into StateBuffer
if (SharedMemoryTransport.IsActive)
{
SharedMemoryTransport.Instance.WriteObservations(StateBuffer);
SharedMemoryTransport.Instance.WriteRewards(/* computed rewards */);
SharedMemoryTransport.Instance.WriteDones(/* computed dones */);
SharedMemoryTransport.Instance.IncrementFrameSeq();
}
ExternalAgentRegistry.PublishState(this);
}Similarly for actions, in GetActions():
if (SharedMemoryTransport.IsActive)
{
SharedMemoryTransport.Instance.ReadActions(ActionBuffer);
// Process reset flags
}C# (Engine side) — same code on all platforms:
// .NET MemoryMappedFile is cross-platform — runtime handles OS differences
var mmf = MemoryMappedFile.CreateNew($"LuckyEngine_{pid}", totalSize, MemoryMappedFileAccess.ReadWrite);
// AcquirePointer for unsafe byte* access
// Interlocked.Read/Exchange for atomic sequence counters (correct on x86-64 and ARM64)Python (Client side) — use multiprocessing.shared_memory for cross-platform:
from multiprocessing.shared_memory import SharedMemory
shm = SharedMemory(name=f"LuckyEngine_{pid}", create=False)
obs = np.ndarray((num_envs, obs_size), dtype=np.float32, buffer=shm.buf, offset=obs_offset)| Windows | Linux | macOS | |
|---|---|---|---|
| C# API | MemoryMappedFile (same) |
Same | Same |
| OS backend | CreateFileMapping |
shm_open + mmap |
shm_open + mmap |
| Python API | SharedMemory("name") (same) |
Same | Same |
| Naming | Kernel object namespace | /dev/shm/ filesystem |
POSIX shm or temp file |
| Atomics | Interlocked.* (C#) |
Same | Same (ARM64: barriers handled by Interlocked) |
ARM64 note (Apple Silicon, ARM cloud instances): Interlocked.* in C# emits proper dmb barriers on ARM64. On the Python side, use ctypes with explicit threading.Event or atomic wrapper rather than raw reads for sequence counters on ARM.
Shared memory name passed to Python via environment variable LUCKY_SHM_NAME or command-line arg.
Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/RpcDispatcher.cs
Reads FlatBuffer Envelopes from the command ring buffer and routes by MethodId:
public class RpcDispatcher
{
private readonly Dictionary<MethodId, Action<Envelope>> _handlers;
public void Poll() // Called each frame from main thread
{
while (_transport.TryReceiveCommand(out var data))
{
var buf = new ByteBuffer(data.Array, data.Offset);
var envelope = Envelope.GetRootAsEnvelope(buf);
if (_handlers.TryGetValue(envelope.Method, out var handler))
handler(envelope);
}
}
}Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/RpcMainThreadDispatcher.cs
Copy of GrpcMainThreadDispatcher.cs — same ConcurrentQueue + Pump() pattern. Reused because command channel reads happen on a polling thread but handlers need main thread access.
Create Hazel-ScriptCore/Source/Hazel/Net/Rpc/Handlers/:
| Handler | Ported From | Notes |
|---|---|---|
AgentServiceHandler.cs |
AgentServiceImpl.cs |
Hot path bypasses commands (raw shm). Reset/Schema via command channel. |
MujocoServiceHandler.cs |
MujocoServiceImpl.cs |
JointState queries, SendControl |
TelemetryServiceHandler.cs |
TelemetryServiceImpl.cs |
Streaming telemetry frames |
SceneServiceHandler.cs |
SceneServiceImpl.cs |
Entity queries, sim mode |
CameraServiceHandler.cs |
CameraServiceImpl.cs |
Image streaming (writes to shm image region) |
ViewportServiceHandler.cs |
ViewportServiceImpl.cs |
Viewport streaming |
DebugServiceHandler.cs |
DebugServiceImpl.cs |
Debug draw queue |
Modify: Hazel-ScriptCore/Source/Hazel/Learn/Utilities/ExternalAgentRegistry.cs
PublishState(): In shm mode, becomes justSharedMemoryTransport.WriteObservations()+IncrementFrameSeq()WaitAndApplyExternalActions(): In shm mode, readsaction_seq, copies from shm action region toActionBuffer- Decouple from
Hazel.Rpc.SimulationContract(protobuf type) → use FlatBuffer type or plain C# record - Remove
s_LatestState/s_LatestActionthread-safe snapshots (shared memory replaces this)
File: Hazel/src/Hazel/Learn/ExternalActionGate.h/cpp — no changes needed
The shared memory transport's polling sets ActionReady when action_seq increments, same signal as before.
Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/RpcHost.cs
- Replaces
GrpcHost.cs. Creates SharedMemoryTransport + RpcDispatcher. - Start/Stop/IsRunning lifecycle.
Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/RpcConfig.cs
- Shared memory name, ring buffer size, per-service enable flags, FPS limits
Create: Hazel-ScriptCore/Source/Hazel/Net/Rpc/Editor/EditorRpcApi.cs
- Replaces
EditorGrpcApi.cs— static Coral bridge methods
Rename: LuckyEditor/src/Panels/GrpcPanel.h/cpp → RpcPanel.h/cpp
- Shared memory name display, connection status, service toggles
Move: Hazel-ScriptCore/Source/Hazel/Net/Grpc/Capture/CaptureBridge.cs → Rpc/Capture/
import mmap, ctypes, struct
import numpy as np
class SharedMemoryClient:
def __init__(self, shm_name: str):
# Windows: named shared memory
self._mm = mmap.mmap(-1, 0, tagname=shm_name, access=mmap.ACCESS_WRITE)
self._parse_header()
# Numpy VIEWS into shared memory — true zero-copy reads
self.obs = np.ndarray(
(self.num_envs, self.obs_size), dtype=np.float32,
buffer=self._mm, offset=self._obs_offset)
self.actions = np.ndarray(
(self.num_envs, self.act_size), dtype=np.float32,
buffer=self._mm, offset=self._act_offset)
self.rewards = np.ndarray(
(self.num_envs,), dtype=np.float32,
buffer=self._mm, offset=self._rewards_offset)
self.dones = np.ndarray(
(self.num_envs,), dtype=np.uint8,
buffer=self._mm, offset=self._dones_offset)
self.resets = np.ndarray(
(self.num_envs,), dtype=np.uint8,
buffer=self._mm, offset=self._resets_offset)
def step(self, actions: np.ndarray):
"""Zero-copy vectorized step."""
self.actions[:] = actions # Direct write to shm
self._increment_action_seq()
self._spin_wait_frame_seq()
return self.obs, self.rewards, self.dones # Numpy views, no copy
def reset(self, env_ids=None, contract=None):
"""Reset specific envs. Contract sent via command channel."""
if env_ids is None:
self.resets[:] = 1
else:
self.resets[env_ids] = 1
if contract:
self._send_command(build_reset_contract_fb(contract))
def get_schema(self):
"""Query agent schema via command channel."""
self._send_command(build_get_schema_fb())
return self._recv_command_blocking()# CPU observations → GPU tensors for training
obs_tensor = torch.from_numpy(client.obs).pin_memory().cuda(non_blocking=True)
# Future: when GPU MuJoCo is available, obs will be CUDA IPC tensors
# client.obs_cuda = torch.cuda.tensor_from_ipc_handle(client.gpu_obs_handle)- Delete
Hazel-ScriptCore/Source/Hazel/Net/Grpc/(entire directory) - Remove
Grpc.Core,Grpc.Tools,Google.Protobuffrom project - Remove
hazel_rpc.protoand generated gRPC code - Remove
scripts/grpc_env.py
| Data | Path | Copies |
|---|---|---|
| Proprioceptive | C# StateBuffer → memcpy → shm → numpy view → .cuda() |
1 memcpy + 1 GPU upload |
| Images (local) | Vulkan render → VK_EXTERNAL_MEMORY → cudaImportExternalMemory → CUDA tensor |
0 (GPU-to-GPU) |
For Vulkan image sharing:
- Engine creates render targets with
VkExternalMemoryImageCreateInfo(VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT) - Exports Win32 handle via
vkGetMemoryWin32HandleKHR - Handle values stored in shm header
- Python imports via CuPy
cudaImportExternalMemory→ device pointer → torch tensor
| Data | Path | Copies |
|---|---|---|
| Proprioceptive | CUDA buffer → cudaIpcGetMemHandle → handle in shm header → Python cudaIpcOpenMemHandle |
0 (same GPU memory) |
| Images | Same as current (already on GPU) | 0 |
Changes when GPU MuJoCo arrives:
AgentBatch.StateBufferbecomes CUDA device buffer- SharedMemoryTransport exports
cudaIpcMemHandle_tin header - Python reads handle, maps as CUDA tensor
- Atomic sync stays in CPU shm header (lightweight, always CPU)
- Shared memory layout + transport (Phase 1 + 3) — get hot data path working
- FlatBuffers schema + codegen (Phase 2) — command channel format
- AgentServiceHandler (Phase 5 partial) — Step via shm, Reset/Schema via commands
- Python SharedMemoryClient — test 4096 envs at 50Hz
- Remaining service handlers (Phase 5 complete)
- Editor panel (Phase 6)
- Cleanup (Phase 8) — remove gRPC
gRPC stays operational alongside during migration. Config flag selects transport.
Hazel-ScriptCore/Source/Hazel/Net/Rpc/
Schema/hazel_rpc.fbs
Generated/ (flatc output)
RpcHost.cs
RpcConfig.cs
RpcDispatcher.cs
RpcMainThreadDispatcher.cs
Transport/SharedMemoryTransport.cs
Transport/RingBuffer.cs
Handlers/AgentServiceHandler.cs
Handlers/MujocoServiceHandler.cs
Handlers/TelemetryServiceHandler.cs
Handlers/SceneServiceHandler.cs
Handlers/CameraServiceHandler.cs
Handlers/ViewportServiceHandler.cs
Handlers/DebugServiceHandler.cs
Editor/EditorRpcApi.cs
Capture/CaptureBridge.cs (moved from Grpc/)
Hazel-ScriptCore/Source/Hazel/Learn/AgentBatch.cs
→ CollectState() writes StateBuffer to shm + increments frame_seq
→ GetActions() reads ActionBuffer from shm when action_seq changes
Hazel-ScriptCore/Source/Hazel/Learn/Utilities/ExternalAgentRegistry.cs
→ Simplify for shm mode, decouple from protobuf SimulationContract
LuckyEditor/src/Panels/GrpcPanel.h/cpp → RpcPanel.h/cpp
→ Shared memory status, service toggles
Hazel-ScriptCore project file
→ Add Google.FlatBuffers, eventually remove gRPC NuGets
Hazel-ScriptCore/Source/Hazel/Net/Grpc/ (entire directory)
scripts/grpc_env.py
- Shared memory layout: Engine creates shm, Python opens and reads header — verify all offsets correct
- Zero-copy proof:
client.obs.ctypes.data == shm_base + obs_offset— same memory address - 4096-env throughput: 4096 envs at 50Hz, measure step latency (target: <1ms batch step)
- Lock-step correctness: frame_seq increments exactly once per step, no missed/double actions
- Reset via command channel: FlatBuffer ResetRequest with SimulationContract, verify DR applies
- Latency benchmark: Shared memory step vs gRPC step round-trip comparison
- Image region: Camera frames written to shm image region, Python reads as numpy, pixels match