Error Flows

What happens when things go wrong

1. Network Partition

network dag-merge

Two groups of nodes lose connectivity. Each partition continues applying local ops independently. On reconnection the system automatically detects divergence and converges.

Trigger

Network connectivity lost between two subsets of group members. Each subset can still communicate internally.

During Partition

Each partition continues signing and applying SignedGroupOps locally. Ops have different parent hashes since they diverged from a shared ancestor.

Detection

On reconnection, periodic GroupStateHeartbeat messages reveal divergent dag_heads and/or differing root_hash. SyncManager triggers a catchup stream.

Recovery

Missing ops are exchanged via GroupDeltaRequest/GroupDeltaResponse. A merge Noop op with multiple parents reconciles the DAG heads. All nodes converge to identical state.

Config & Metrics

heartbeat_interval — 30s default, controls detection latency
max_dag_heads — capped at 64, triggers merge when exceeded
sync_catchup_ops — counter metric for ops transferred during catchup

2. Invalid Signature

crypto gossip

A remote node receives a SignedGroupOpV1 whose Ed25519 signature does not verify. The message is silently discarded with no state change on the receiver.

Trigger

Receive a SignedGroupOpV1 via gossipsub where verify_signature() returns Err. Could be tampered payload, wrong key, or serialization corruption.

What Happens

Op is rejected immediately. Not applied to GroupStore, not appended to DagStore, not forwarded to other peers. No state mutation on the receiving node whatsoever.

Recovery

No recovery needed — the system is already in a correct state. gossipsub's peer scoring mechanism may lower the sender's reputation if this happens repeatedly.

Design Rationale

Silent rejection is deliberate: no error response leaks information to potential attackers. The signature check is the first gate — all other validation (nonce, state_hash, auth) happens after.

3. Stale state_hash (Optimistic Lock)

state concurrency

Two admins sign operations against the same state_hash concurrently. The first applied successfully mutates the hash, causing the second to fail validation. This prevents silent divergence.

Trigger

Two admins concurrently read the same group state, both sign ops with state_hash = current_hash, and submit them roughly simultaneously.

First Op Wins

The first op to reach apply_local_signed_group_op passes validation, mutates the group state, and updates state_hash to a new value.

Second Op Rejected

The second op's state_hash no longer matches. apply_local_signed_group_op returns an error. The op is never applied or gossipped.

Recovery

The rejected admin re-reads current state (including the other admin's changes), re-signs with the updated state_hash, and resubmits. If state_hash is set to all-zeros, the check is explicitly bypassed.

Config & Metrics

state_hash_conflicts — counter metric for rejected ops due to stale state_hash
[u8; 32]::default() — all-zeros hash signals explicit bypass (used for Noop, bootstrap)

4. Out-of-Order Op Delivery

dag gossip

A node receives a SignedGroupOp whose parent_op_hashes reference operations not yet seen locally. The op enters a pending queue until parents arrive.

Trigger

gossipsub delivers ops non-deterministically. A child op arrives before its parents. The node cannot apply it because the parent hashes are not in the local DagStore.

Pending Queue

The op enters DagStore's pending queue, indexed by its missing parent hashes. The op is held but not discarded.

Parent Arrival

When missing parents arrive (via gossip or catchup stream), the DagStore checks the pending queue for any ops now unblocked. These are applied in topological order.

Abuse Prevention

parent_op_hashes is capped at 256 entries per op. Ops exceeding this are rejected. Stale pending ops are cleaned up periodically.

Config & Metrics

MAX_PARENT_OP_HASHES — 256, hard cap per SignedGroupOp
pending_ops_count — gauge metric for ops awaiting parents
pending_ops_resolved — counter for ops successfully unblocked

5. WASM Out-of-Memory

wasm runtime

The WASM runtime enforces VMLimits (max memory pages, stack size). If a WASM module exceeds these limits, execution is trapped and no state delta is produced.

Trigger

WASM module calls memory.grow() beyond VMLimits.max_memory_pages, or stack depth exceeds max_stack_size, or fuel runs out.

Trap & Catch

Wasmer's runtime traps with a panic. VMLogic catches the panic at the host boundary and converts it into a structured Outcome with an error variant.

No Side Effects

No state delta is produced. No changeset is committed to storage. No gossip message is broadcast. The execution is reported as failed in the Outcome returned to the caller.

Recovery

Caller receives the error and can retry with smaller input, or the WASM application can be upgraded to use less memory. Node state is completely unaffected.

Config & Metrics

VMLimits.max_memory_pages — configurable per-context memory ceiling
VMLimits.max_stack_size — stack depth limit
wasm_oom_traps — counter metric for OOM trap events
wasm_execution_failures — counter for all execution failures

6. Missing DAG Parents (State Delta)

dag sync

A node receives a StateDelta whose parent_ids reference deltas not yet in the local DeltaStore. The delta enters a pending queue until parents are fetched.

Trigger

A StateDelta arrives via gossip with parent_ids referencing deltas the node has not yet received or applied.

Pending Queue

Delta enters DeltaStore pending, indexed by missing parent IDs. The delta is stored but not applied.

Fetch Missing

NodeManager sends a DeltaRequest stream to peers who likely have the missing parent. Peers respond with the missing delta payload. If the initial peer does not resolve every missing parent, SyncManager iterates additional mesh peers for the context up to parent_pull_additional_peers, bounded by parent_pull_budget; unresolved pending parents after the budget surface as a sync-session error so callers (e.g. join_context) fail loud rather than reporting success on a partial DAG.

The same pattern applies to governance DAG ops. When a NamespaceGovernanceDelta arrives via gossip with missing parents, the node immediately opens a NamespaceBackfillRequest stream to the gossip source (rather than waiting for the periodic NamespaceStateHeartbeat cycle) and falls through to the same cross-peer iteration if the first peer cannot drain the pending chain.

Resolution

When all parents arrive, pending deltas are drained in topological order. Stale pending deltas that are never resolved are cleaned up periodically.

Config & Metrics

pending_deltas_count — gauge for deltas awaiting parents
delta_request_sent — counter for DeltaRequest messages emitted
stale_pending_cleanup_interval — periodic cleanup timer
parent_pull_additional_peers — max additional mesh peers tried after the initial sync peer (default 3, applies to data-delta and namespace governance parent pulls)
parent_pull_budget — wall-clock budget for the cross-peer parent-pull loop (default 10s)

7. Snapshot Sync Failure

sync stream

During a snapshot transfer the connection drops. The DeltaBuffer may have accumulated deltas during the transfer. On retry, the system determines whether partial state is reusable or a fresh snapshot is needed.

Trigger

TCP/QUIC connection drops during a SnapshotStreamRequest transfer. The joining node has received partial snapshot data.

DeltaBuffer

While the snapshot was transferring, new deltas continued arriving via gossip. These are held in a DeltaBuffer with finite capacity. If the buffer fills, oldest deltas are dropped (sync_buffer_drops metric).

Retry Handshake

On reconnection, a new handshake compares the partial state's root_hash against the source's current state. If compatible, the transfer can resume from where it left off.

Recovery

If partial state is usable: apply buffered deltas and continue. If not: discard partial state, request a fresh snapshot from scratch. Invariant I6: buffer has finite capacity, drops oldest if full.

Config & Metrics

sync_buffer_capacity — max deltas held during snapshot transfer
sync_buffer_drops — counter for deltas dropped due to buffer overflow
snapshot_retry_count — counter for snapshot transfer retries
snapshot_transfer_bytes — histogram of snapshot sizes

8. Cascade on Member Removal

cascade governance

When a member is removed from a group, all their context identities and join tracking across every context in that group are cleaned up in one deterministic cascade operation.

Trigger

A MemberRemoved GroupOp is applied to the GroupStore. The member is removed from the group's member list.

Index Scan

GroupStore scans the GroupMemberContext index to find all contexts the removed member had joined via this group. This index is maintained automatically when members join contexts within the group.

Cascade Removal

For each found context: the member's context identity is removed, their join tracking row is deleted. If the member was in 100 contexts, all 100 get cleaned up atomically within the single op apply.

Convergence Guarantee

The cascade is fully deterministic — the same MemberRemoved op on any node produces identical deletions. This is verified by the convergence test suite (21 two-node tests).

Config & Metrics

cascade_contexts_removed — counter for context memberships cleaned up per MemberRemoved op
cascade_duration_ms — histogram for time spent in cascade cleanup
No upper limit on contexts per member — cascade is O(n) in number of contexts the member joined