Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save mitchmindtree/a846442a03e642c93c1528dafcd7b077 to your computer and use it in GitHub Desktop.

Select an option

Save mitchmindtree/a846442a03e642c93c1528dafcd7b077 to your computer and use it in GitHub Desktop.
snarkVM Error Handling Overhaul (pre-proposal draft)

snarkVM Error Handling Overhaul

  • Status: Pre-proposal draft
  • Authors: Kai / Mitch
  • Date: 2026-03-04

1. Summary

snarkVM's error handling is inconsistent. The codebase uses a mix of anyhow throughout library crates, panic!-based halts caught with catch_unwind, string-flattened error chains, and incomplete error context in logs. This leads to:

  • Lost debugging information - error chains are flattened to strings before reaching logs, discarding the structured cause chain (#3147).
  • User-facing panics - Environment::halt panics the host process, and catch_unwind cannot always recover safely (leo#28992).
  • Inability to handle specific failure modes - anyhow::Error erases error types, preventing downstream tooling from matching on specific failures.
  • catch_unwind hacks - try_vm_runtime! conflates VM halts with implementation bugs, and AssertUnwindSafe wrappers suppress the type system's unwind-safety guarantees.

The vision: structured thiserror types for all library crates, clean source() chains, no user-reachable panics, and downstream applications owning error formatting.

2. Goals

  1. Clean error chains - Each error layer describes only its own context. Inner errors are accessible via source(), never duplicated in Display.

  2. No user-reachable panic!s - All anticipated runtime failures return Result. Any user-facing panic is treated as a VM bug.

  3. Structured thiserror types for all library crates - Concrete enums with exhaustive matching. anyhow is acceptable only in top-level applications and tests.

  4. Remove catch_unwind / try_vm_runtime! - Once halt is a Result, panic-catching infrastructure becomes unnecessary.

  5. Downstream-controlled formatting - Libraries expose source() chains; applications choose presentation (single-line, multi-line, JSON, etc.).

  6. Descriptive structured error data - Errors carry machine-readable context (instruction index, program ID, operand values) enabling rich tooling feedback and Leo source mapping.

3. Current State - Problems

3.1 Environment::halt panics the host

The Environment trait defines a halt method that panics unconditionally:

// console/network/environment/src/environment.rs:61
fn halt<S: Into<String>, T>(message: S) -> T {
    panic!("{}", message.into())
}

There are 262+ call sites across E::halt (~152 occurrences) and A::halt (~110 occurrences) throughout the console and circuit crates.

A convenience trait OrHalt (in console/network/environment/src/helpers/or_halt.rs) further propagates this pattern:

fn or_halt<E: Environment>(self) -> T {
    match self {
        Ok(result) => result,
        Err(error) => E::halt(error.to_string()),
    }
}

catch_unwind cannot always recover from these panics safely. Panics inside borrowed RefCell in circuit thread-local state are not unwind-safe - the AssertUnwindSafe wrapper suppresses the compiler's warning but does not fix the underlying unsoundness. This manifests as user-facing crashes (leo#28992).

3.2 Error chain breakage

Two patterns break source() chains:

String interpolation flattening: Patterns like anyhow!("something failed: {err}") or bail!("something failed: {err}") flatten the inner error into a string, destroying the source() chain. The correct approach is .context("something failed").

#[error] format string interpolation of #[from] fields: Some thiserror types interpolate their #[from]/#[source] fields in the #[error("...")] format string:

#[derive(Debug, Error)]
enum MyError {
    #[error("operation failed: {0}")]  // Duplicates the inner error message
    Inner(#[from] InnerError),
}

This causes message duplication when walking the chain - Display includes the inner message, and so does source(). The std Error docs explicitly advise against this:

An error type with a child error should either [...] not mention the child error in its Display implementation.

The correct pattern is:

#[error("operation failed")]           // Describes only this layer
Inner(#[from] InnerError),
// or:
#[error(transparent)]                  // Delegates Display entirely
Inner(#[from] InnerError),

3.3 Lost error context in logs

The error chain is sometimes correctly constructed but not correctly rendered. For example:

// ledger/src/advance.rs:146
self.vm.add_next_block(block).with_context(|| "Failed to add block to VM")?;

This properly chains errors. However, downstream consumers (e.g. snarkOS CDN sync) log only "{err}", which renders only the top-level message. The full chain - which contains the actual failure reason - is discarded.

This is a consumer-side problem. The chain is correct; the rendering is not. snarkVM already provides utilities for chain rendering in utilities/src/errors.rs:

/// Converts an `anyhow::Error` into a single-line string.
pub fn flatten_error<E: Borrow<anyhow::Error>>(error: E) -> String { ... }

/// Displays an `anyhow::Error`'s main error and its error chain to stderr.
pub fn display_error<E: Borrow<anyhow::Error>>(error: E) { ... }

These are currently specialized to anyhow::Error. Generalizing them to &dyn std::error::Error would make them useful for the thiserror migration.

See #3147 and snarkOS#3795.

3.4 try_vm_runtime! macro

The try_vm_runtime! macro (utilities/src/vm_error.rs) catches halt-panics using catch_unwind:

macro_rules! try_vm_runtime {
    ($e:expr) => {{
        let previous_hook = panic::take_hook();
        panic::set_hook(Box::new(|err| { /* reformat as "VM safely halted" */ }));
        let result = panic::catch_unwind(panic::AssertUnwindSafe($e));
        panic::set_hook(previous_hook);
        result
    }};
}

Problems:

  • Conflates VM halts with bugs - a panic from halt("invalid operand") is indistinguishable from a panic caused by an index-out-of-bounds bug.
  • Global panic hook manipulation - take_hook / set_hook is not thread-safe in concurrent contexts. Concurrent VM executions can interfere with each other's panic hooks.
  • AssertUnwindSafe - suppresses the compiler's unwind-safety analysis, masking potential state corruption.

3.5 catch_unwind in circuit code

Beyond try_vm_runtime!, catch_unwind(AssertUnwindSafe(...)) appears in 18 files across the codebase, including:

  • circuit/program/src/data/literal/cast/mod.rs - 2 instances
  • circuit/types/field/src/div.rs, div_unchecked.rs, inverse.rs - 3+ instances
  • circuit/types/integers/src/neg.rs, lib.rs - 2+ instances
  • circuit/types/scalar/src/helpers/from_field.rs
  • Various test files

Most of these exist because circuit operations signal failure via panic (through E::halt) rather than returning Result. Once these operations return Result, the wrappers become unnecessary.

3.6 std::io::Error::other misuse

Some deserialization sites use std::io::Error::other(stringified_error), losing type information. The io_error and into_io_error helper functions in utilities/src/errors.rs encode this pattern. While io::Error boundaries are sometimes unavoidable (e.g. Read/Write trait impls), the error chain should be preserved where possible. See discussion in #3056.

4. Work Completed

What PR/Issue Status
Return error with failing instruction index #3081 Merged
Result for constraint enforcement and assertions #3082 Merged
Clippy fix for instruction wrappers #3085 Merged
Isolate synthesizer-error crate #3122 Merged
CheckBlockError - concrete block-check errors #3050 Merged

The synthesizer-error crate (synthesizer/error/) now provides a structured error hierarchy using thiserror:

VmExecError / VmAuthError / VmDeployError
  └─ ProcessExecError / ProcessAuthError / ProcessDeployError
       └─ StackExecError / StackEvalError
            └─ IndexedInstructionError<InstructionError>
                 └─ InstructionEvalError / InstructionExecError
                      └─ EvalError / ExecError / AssertError

Each level carries Anyhow(#[from] anyhow::Error) variants as temporary escape hatches for the migration. These are explicitly marked with // NOTE: ... Remove these variants as we migrate errors to thiserror.

The AssertError type demonstrates the target pattern - structured data (operand values) in error variants:

#[error("'assert.eq' failed: '{lhs}' is not equal to '{rhs}' (should be equal)")]
Eq { lhs: String, rhs: String },

PRs #3081 and #3082 together establish the migration pattern for replacing Environment::halt with Result:

  • #3081 built the error propagation infrastructure: domain-specific error types per subsystem, IndexedInstructionError<E> to capture which instruction failed and at what index, and updated function signatures from Result<T> (anyhow) to Result<T, ProcessExecError> etc.
  • #3082 converted the lowest-level circuit operations (E::enforce(), E::assert_eq(), E::assert_neq()) to return Result<(), ConstraintUnsatisfied> instead of panicking, and introduced AssertError, EvalError, ExecError with structured data. Unconverted boundaries use temporary .expect() calls to maintain existing behavior while the conversion progresses.

Together they demonstrate the bottom-up, subsystem-by-subsystem approach: convert the lowest-level operations first, introduce domain-specific error types, use .expect() at unconverted boundaries, and propagate upward through the error hierarchy.

5. Work In Progress

What PR/Issue Status
Remove source interpolation from #[error] format strings #3172 Open (WIP)
Improved panic handling infrastructure #2927 Open
build.rs error chain checking (snarkOS) snarkOS#4127 Draft
Track errors in snarkOS snarkOS#3874 Open

6. Open Issues

Issue Repo Summary
#2941 snarkVM Remove panic potentials in validator code paths
#3055 snarkVM On halt, return error with failing instruction index
#3056 snarkVM Replace anyhow in lib crates with thiserror
#2787 snarkVM Return descriptive error on failure to execute/evaluate
#3147 snarkVM Improve logged errors (chain context lost)
leo#28992 leo Panic on assert_eq on arrays (VM halt panics host)
leo#29035 leo On VM halt, error should provide Leo source context
leo#29036 leo Preserve source mapping during codegen for tooling
leo#27858 leo Testing framework panics instead of errors
snarkOS#3795 snarkOS Logs uninformative for node operators

7. Recommended Approach (Phased)

Phase 1: Fix error formatting (in progress)

Complete #3172 - remove source-error interpolation from #[error] format strings across the codebase.

This establishes the convention: each error layer describes only its own context. Inner errors are accessed via source(), not duplicated in Display.

Before:

#[error("failed to parse '{0}'")]
Parse(#[from] ParseError),    // Display: "failed to parse 'invalid token at col 5'"
                               // source(): "invalid token at col 5"  (duplicated)

After:

#[error("failed to parse")]
Parse(#[from] ParseError),    // Display: "failed to parse"
                               // source(): "invalid token at col 5"  (no duplication)

Phase 2: Replace Environment::halt with Result (incremental)

Continue the bottom-up approach established in #3081 and #3082. Convert one subsystem at a time:

  1. Identify the next subsystem - Pick a group of related E::halt / A::halt call sites (e.g. field arithmetic, integer operations, group operations, string operations, Merkle tree verification).
  2. Introduce domain-specific error types - Define thiserror enums for the subsystem's failure modes. These should carry structured data where useful (operand values, indices, etc.), not just string messages.
  3. Convert lowest-level operations first - Change the leaf functions from E::halt(msg) to return Err(SpecificError::Variant { ... }) and update their return types.
  4. Use .expect() at unconverted boundaries - Where a converted function is called by unconverted code that still expects infallible results, use .expect("justification") temporarily. This preserves existing behavior while making the converted boundary explicit and greppable.
  5. Propagate upward - As more subsystems are converted, the .expect() boundaries move upward through the call stack until they reach the public API surface (e.g. VM::execute), where the error is returned to the caller.

Prioritization: Focus on subsystems where panics cause the most downstream pain first - instruction execution (done), constraint enforcement (done), then field/integer/group arithmetic, string operations, and record/plaintext serialization.

For upstream trait impls (e.g. std::ops::Div, std::ops::Add) where the trait signature does not allow Result, consider introducing new CheckedDiv or CheckedAdd alternatives. Only keep panic! directly if these represent genuine logic errors or are only reachable through already-validated code paths.

Phase 3: Remove catch_unwind / try_vm_runtime!

As subsystems are converted to return Result, the catch_unwind wrappers around those subsystems become unnecessary. Remove them incrementally:

  1. Remove catch_unwind(AssertUnwindSafe(...)) from converted call sites.
  2. Once all halt sites in production paths return Result, remove the try_vm_runtime! macro from utilities/src/vm_error.rs.
  3. Downstream (leo, snarkOS) can remove their catch_unwind wrappers as the corresponding VM APIs are converted.

Validation: Ensure no remaining code path relies on panic-based error signaling for control flow.

Phase 4: Replace remaining anyhow with thiserror

Crate-by-crate migration, prioritized by downstream impact:

  1. synthesizer - already started with synthesizer-error. Remove temporary Anyhow(#[from] anyhow::Error) variants as concrete types are introduced.
  2. ledger - block validation, storage, transaction processing.
  3. console - parsing, serialization, type conversion.
  4. algorithms - cryptographic operations.

Each crate follows the same pattern:

  1. Define a thiserror enum for the crate's error domain.
  2. Replace anyhow::Result with Result<T, CrateError>.
  3. Use #[from] for conversions from sub-crate errors.
  4. Remove anyhow from Cargo.toml once fully migrated.

Consider #[non_exhaustive] selectively for error types in public APIs to allow adding variants without breaking downstream.

Phase 5: Structured error data for tooling

With typed errors in place, enrich them with machine-readable context:

  • Instruction index - already implemented in IndexedInstructionError.
  • Program ID - which program was being executed.
  • Function name - which function within the program.
  • Operand values - already demonstrated in AssertError.
  • Source mapping - enable Leo to map VM errors back to source locations (leo#29036).

This phase transforms error handling from a debugging concern into a tooling enabler - IDEs, testing frameworks, and block explorers can provide precise, actionable feedback.

8. Conventions (Going Forward)

Library crates

  • All error types use thiserror. No anyhow in public APIs.
  • #[error("...")] describes only the current layer. Never interpolate #[source] or #[from] fields.
  • Use #[source] or #[from] to build proper source() chains.
  • panic! is reserved for genuine logic bugs (invariant violations that indicate a programming error, not a runtime condition).

Application crates and tests

  • anyhow is acceptable for top-level error aggregation (e.g. bin crates).
  • Tests may use anyhow::Result for convenience or more easily formatting error chains for expectations.

Error formatting

  • Libraries expose clean source() chains.
  • Applications (snarkOS, leo tooling) choose the presentation - anyhow makes this easy:
    • top-level / outermost error: {}
    • whole error chain: {:#}
    • multi-line error chain with backtrace: {:?}
  • The existing flatten_error and display_error utilities in utilities/src/errors.rs should be generalized from anyhow::Error to &dyn std::error::Error as the migration progresses.

Avoid generic type parameters on error types

Error types should not carry generic parameters like N: Network. The thiserror derive macro requires all #[source] and #[from] fields to implement std::error::Error, which for a type like MyError<N> can create a recursive trait bound if the error contains itself.

This was encountered with CheckBlockError<N>, where #[source] on Box<CheckBlockError<N>> in the InvalidPrefix variant triggers the recursive bound.

Instead: Use concrete types for error data. If an error needs to carry data that varies by network (e.g. a block hash), use String or a type-erased representation rather than N::BlockHash. Keep the N: Network generic on the functions that produce errors, but erase it before constructing the error value.

Transition pattern

During migration, Anyhow(#[from] anyhow::Error) catch-all variants serve as escape hatches (as seen in synthesizer-error). These are explicitly temporary and should be tracked for removal.

9. Risks and Considerations

Breaking changes

Typed results in public APIs require semver-aware coordination. Each phase should be validated with draft PRs against:

Scope

Full migration is large. The phased approach limits blast radius:

  • Phase 1 is a low-risk formatting convention change.
  • Phase 2 is the largest effort, but its incremental nature means each subsystem conversion is a self-contained, reviewable PR.
  • Phases 3-5 follow naturally as Phase 2 progresses.

Consensus safety

All changes must be backwards compatible. Error handling changes should not alter consensus behavior (same inputs producing same outputs). The risk is low

  • error types affect failure paths, not success paths - but any change to which operations return Err vs panic! must be validated against the existing test suite and network behavior.

#[non_exhaustive] trade-offs

  • Pro: Allows adding error variants without breaking downstream.
  • Con: Prevents exhaustive matching, forcing _ => catch-all arms.
  • Recommendation: Use selectively for error types at crate boundaries that are likely to grow. Internal error types can remain exhaustive.

Coexistence period

During migration, the codebase will contain both anyhow and thiserror errors. The Anyhow catch-all variant pattern (as in synthesizer-error) is a practical bridge. The key discipline is ensuring new code uses thiserror and that Anyhow variants are tracked and eventually removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment