Error Flows
What happens when things go wrong
1. Network Partition
network dag-mergeTwo groups of nodes lose connectivity. Each partition continues applying local ops independently. On reconnection the system automatically detects divergence and converges.
Trigger
Network connectivity lost between two subsets of group members. Each subset can still communicate internally.
During Partition
Each partition continues signing and applying SignedGroupOps locally. Ops have different parent hashes since they diverged from a shared ancestor.
Detection
On reconnection, periodic GroupStateHeartbeat messages reveal divergent dag_heads and/or differing root_hash. SyncManager triggers a catchup stream.
Recovery
Missing ops are exchanged via GroupDeltaRequest/GroupDeltaResponse. A merge Noop op with multiple parents reconciles the DAG heads. All nodes converge to identical state.
Config & Metrics
- heartbeat_interval — 30s default, controls detection latency
- max_dag_heads — capped at 64, triggers merge when exceeded
- sync_catchup_ops — counter metric for ops transferred during catchup
2. Invalid Signature
crypto gossipA remote node receives a SignedGroupOpV1 whose Ed25519 signature does not verify. The message is silently discarded with no state change on the receiver.
Trigger
Receive a SignedGroupOpV1 via gossipsub where verify_signature() returns Err. Could be tampered payload, wrong key, or serialization corruption.
What Happens
Op is rejected immediately. Not applied to GroupStore, not appended to DagStore, not forwarded to other peers. No state mutation on the receiving node whatsoever.
Recovery
No recovery needed — the system is already in a correct state. gossipsub's peer scoring mechanism may lower the sender's reputation if this happens repeatedly.
Design Rationale
Silent rejection is deliberate: no error response leaks information to potential attackers. The signature check is the first gate — all other validation (nonce, state_hash, auth) happens after.
3. Stale state_hash (Optimistic Lock)
state concurrencyTwo admins sign operations against the same state_hash concurrently. The first applied successfully mutates the hash, causing the second to fail validation. This prevents silent divergence.
Trigger
Two admins concurrently read the same group state, both sign ops with state_hash = current_hash, and submit them roughly simultaneously.
First Op Wins
The first op to reach apply_local_signed_group_op passes validation, mutates the group state, and updates state_hash to a new value.
Second Op Rejected
The second op's state_hash no longer matches. apply_local_signed_group_op returns an error. The op is never applied or gossipped.
Recovery
The rejected admin re-reads current state (including the other admin's changes), re-signs with the updated state_hash, and resubmits. If state_hash is set to all-zeros, the check is explicitly bypassed.
Config & Metrics
- state_hash_conflicts — counter metric for rejected ops due to stale state_hash
- [u8; 32]::default() — all-zeros hash signals explicit bypass (used for Noop, bootstrap)
4. Out-of-Order Op Delivery
dag gossipA node receives a SignedGroupOp whose parent_op_hashes reference operations not yet seen locally. The op enters a pending queue until parents arrive.
Trigger
gossipsub delivers ops non-deterministically. A child op arrives before its parents. The node cannot apply it because the parent hashes are not in the local DagStore.
Pending Queue
The op enters DagStore's pending queue, indexed by its missing parent hashes. The op is held but not discarded.
Parent Arrival
When missing parents arrive (via gossip or catchup stream), the DagStore checks the pending queue for any ops now unblocked. These are applied in topological order.
Abuse Prevention
parent_op_hashes is capped at 256 entries per op. Ops exceeding this are rejected. Stale pending ops are cleaned up periodically.
Config & Metrics
- MAX_PARENT_OP_HASHES — 256, hard cap per SignedGroupOp
- pending_ops_count — gauge metric for ops awaiting parents
- pending_ops_resolved — counter for ops successfully unblocked
5. WASM Out-of-Memory
wasm runtimeThe WASM runtime enforces VMLimits (max memory pages, stack size). If a WASM module exceeds these limits, execution is trapped and no state delta is produced.
Trigger
WASM module calls memory.grow() beyond VMLimits.max_memory_pages, or stack depth exceeds max_stack_size, or fuel runs out.
Trap & Catch
Wasmer's runtime traps with a panic. VMLogic catches the panic at the host boundary and converts it into a structured Outcome with an error variant.
No Side Effects
No state delta is produced. No changeset is committed to storage. No gossip message is broadcast. The execution is reported as failed in the Outcome returned to the caller.
Recovery
Caller receives the error and can retry with smaller input, or the WASM application can be upgraded to use less memory. Node state is completely unaffected.
Config & Metrics
- VMLimits.max_memory_pages — configurable per-context memory ceiling
- VMLimits.max_stack_size — stack depth limit
- wasm_oom_traps — counter metric for OOM trap events
- wasm_execution_failures — counter for all execution failures
6. Missing DAG Parents (State Delta)
dag syncA node receives a StateDelta whose parent_ids reference deltas not yet in the local DeltaStore. The delta enters a pending queue until parents are fetched.
Trigger
A StateDelta arrives via gossip with parent_ids referencing deltas the node has not yet received or applied.
Pending Queue
Delta enters DeltaStore pending, indexed by missing parent IDs. The delta is stored but not applied.
Fetch Missing
NodeManager sends a DeltaRequest stream to peers who likely have the missing parent. Peers respond with the missing delta payload.
Resolution
When all parents arrive, pending deltas are drained in topological order. Stale pending deltas that are never resolved are cleaned up periodically.
Config & Metrics
- pending_deltas_count — gauge for deltas awaiting parents
- delta_request_sent — counter for DeltaRequest messages emitted
- stale_pending_cleanup_interval — periodic cleanup timer
7. Snapshot Sync Failure
sync streamDuring a snapshot transfer the connection drops. The DeltaBuffer may have accumulated deltas during the transfer. On retry, the system determines whether partial state is reusable or a fresh snapshot is needed.
Trigger
TCP/QUIC connection drops during a SnapshotStreamRequest transfer. The joining node has received partial snapshot data.
DeltaBuffer
While the snapshot was transferring, new deltas continued arriving via gossip. These are held in a DeltaBuffer with finite capacity. If the buffer fills, oldest deltas are dropped (sync_buffer_drops metric).
Retry Handshake
On reconnection, a new handshake compares the partial state's root_hash against the source's current state. If compatible, the transfer can resume from where it left off.
Recovery
If partial state is usable: apply buffered deltas and continue. If not: discard partial state, request a fresh snapshot from scratch. Invariant I6: buffer has finite capacity, drops oldest if full.
Config & Metrics
- sync_buffer_capacity — max deltas held during snapshot transfer
- sync_buffer_drops — counter for deltas dropped due to buffer overflow
- snapshot_retry_count — counter for snapshot transfer retries
- snapshot_transfer_bytes — histogram of snapshot sizes
8. Cascade on Member Removal
cascade governanceWhen a member is removed from a group, all their context identities and join tracking across every context in that group are cleaned up in one deterministic cascade operation.
Trigger
A MemberRemoved GroupOp is applied to the GroupStore. The member is removed from the group's member list.
Index Scan
GroupStore scans the GroupMemberContext index to find all contexts the removed member had joined via this group. This index is maintained automatically when members join contexts within the group.
Cascade Removal
For each found context: the member's context identity is removed, their join tracking row is deleted. If the member was in 100 contexts, all 100 get cleaned up atomically within the single op apply.
Convergence Guarantee
The cascade is fully deterministic — the same MemberRemoved op on any node produces identical deletions. This is verified by the convergence test suite (21 two-node tests).
Config & Metrics
- cascade_contexts_removed — counter for context memberships cleaned up per MemberRemoved op
- cascade_duration_ms — histogram for time spent in cascade cleanup
- No upper limit on contexts per member — cascade is O(n) in number of contexts the member joined