Error Flows

What happens when things go wrong

1. Network Partition

network dag-merge

Two groups of nodes lose connectivity. Each partition continues applying local ops independently. On reconnection the system automatically detects divergence and converges.

Partition A Node 1 Node 2 ops A1 → A2 → A3 Partition B Node 3 Node 4 ops B1 → B2 PARTITION reconnect Heartbeat: divergent dag_heads SyncManager catchup DAG Merge: Noop op reconciles heads A3 + B2 → M (both partitions converge)

Trigger

Network connectivity lost between two subsets of group members. Each subset can still communicate internally.

During Partition

Each partition continues signing and applying SignedGroupOps locally. Ops have different parent hashes since they diverged from a shared ancestor.

Detection

On reconnection, periodic GroupStateHeartbeat messages reveal divergent dag_heads and/or differing root_hash. SyncManager triggers a catchup stream.

Recovery

Missing ops are exchanged via GroupDeltaRequest/GroupDeltaResponse. A merge Noop op with multiple parents reconciles the DAG heads. All nodes converge to identical state.

Config & Metrics

  • heartbeat_interval — 30s default, controls detection latency
  • max_dag_heads — capped at 64, triggers merge when exceeded
  • sync_catchup_ops — counter metric for ops transferred during catchup

2. Invalid Signature

crypto gossip

A remote node receives a SignedGroupOpV1 whose Ed25519 signature does not verify. The message is silently discarded with no state change on the receiver.

Malicious / Buggy invalid signature SignedGroupOpV1 Receiving Node verify_signature() ✗ FAIL Silently rejected no apply · no forward gossipsub handles sender reputation scoring — no explicit penalty in governance layer

Trigger

Receive a SignedGroupOpV1 via gossipsub where verify_signature() returns Err. Could be tampered payload, wrong key, or serialization corruption.

What Happens

Op is rejected immediately. Not applied to GroupStore, not appended to DagStore, not forwarded to other peers. No state mutation on the receiving node whatsoever.

Recovery

No recovery needed — the system is already in a correct state. gossipsub's peer scoring mechanism may lower the sender's reputation if this happens repeatedly.

Design Rationale

Silent rejection is deliberate: no error response leaks information to potential attackers. The signature check is the first gate — all other validation (nonce, state_hash, auth) happens after.

3. Stale state_hash (Optimistic Lock)

state concurrency

Two admins sign operations against the same state_hash concurrently. The first applied successfully mutates the hash, causing the second to fail validation. This prevents silent divergence.

state_hash = 0xABC… Admin A signs against 0xABC Admin B signs against 0xABC ✓ Applied successfully state_hash → 0xDEF… ✗ state_hash mismatch 0xABC ≠ 0xDEF Re-read state → re-sign → retry If state_hash is all-zeros → validation skipped (explicit bypass)

Trigger

Two admins concurrently read the same group state, both sign ops with state_hash = current_hash, and submit them roughly simultaneously.

First Op Wins

The first op to reach apply_local_signed_group_op passes validation, mutates the group state, and updates state_hash to a new value.

Second Op Rejected

The second op's state_hash no longer matches. apply_local_signed_group_op returns an error. The op is never applied or gossipped.

Recovery

The rejected admin re-reads current state (including the other admin's changes), re-signs with the updated state_hash, and resubmits. If state_hash is set to all-zeros, the check is explicitly bypassed.

Config & Metrics

  • state_hash_conflicts — counter metric for rejected ops due to stale state_hash
  • [u8; 32]::default() — all-zeros hash signals explicit bypass (used for Noop, bootstrap)

4. Out-of-Order Op Delivery

dag gossip

A node receives a SignedGroupOp whose parent_op_hashes reference operations not yet seen locally. The op enters a pending queue until parents arrive.

time → Op C received parent: B (unknown!) → DagStore pending queue Op A received parent: genesis ✓ ✓ applied Op B received parent: A ✓ ✓ applied B arrives → unblocks C from pending topological order: A → B → C applied

Trigger

gossipsub delivers ops non-deterministically. A child op arrives before its parents. The node cannot apply it because the parent hashes are not in the local DagStore.

Pending Queue

The op enters DagStore's pending queue, indexed by its missing parent hashes. The op is held but not discarded.

Parent Arrival

When missing parents arrive (via gossip or catchup stream), the DagStore checks the pending queue for any ops now unblocked. These are applied in topological order.

Abuse Prevention

parent_op_hashes is capped at 256 entries per op. Ops exceeding this are rejected. Stale pending ops are cleaned up periodically.

Config & Metrics

  • MAX_PARENT_OP_HASHES — 256, hard cap per SignedGroupOp
  • pending_ops_count — gauge metric for ops awaiting parents
  • pending_ops_resolved — counter for ops successfully unblocked

5. WASM Out-of-Memory

wasm runtime

The WASM runtime enforces VMLimits (max memory pages, stack size). If a WASM module exceeds these limits, execution is trapped and no state delta is produced.

WASM Module memory.grow() VMLimits exceeded max_memory_pages max_stack_size Wasmer trap panic caught VMLogic Outcome { error } ✗ No state delta produced ✗ No gossip broadcast ✓ Error returned to caller Storage layer is never touched — no partial writes, no rollback needed. WASM sandbox provides clean isolation.

Trigger

WASM module calls memory.grow() beyond VMLimits.max_memory_pages, or stack depth exceeds max_stack_size, or fuel runs out.

Trap & Catch

Wasmer's runtime traps with a panic. VMLogic catches the panic at the host boundary and converts it into a structured Outcome with an error variant.

No Side Effects

No state delta is produced. No changeset is committed to storage. No gossip message is broadcast. The execution is reported as failed in the Outcome returned to the caller.

Recovery

Caller receives the error and can retry with smaller input, or the WASM application can be upgraded to use less memory. Node state is completely unaffected.

Config & Metrics

  • VMLimits.max_memory_pages — configurable per-context memory ceiling
  • VMLimits.max_stack_size — stack depth limit
  • wasm_oom_traps — counter metric for OOM trap events
  • wasm_execution_failures — counter for all execution failures

6. Missing DAG Parents (State Delta)

dag sync

A node receives a StateDelta whose parent_ids reference deltas not yet in the local DeltaStore. The delta enters a pending queue until parents are fetched.

StateDelta received parent_ids: [X, Y] DeltaStore lookup parent X = ✗ unknown → DeltaStore pending NodeManager DeltaRequest { delta_id: X } Parent X arrives → unblock Pending deltas applied in topological order

Trigger

A StateDelta arrives via gossip with parent_ids referencing deltas the node has not yet received or applied.

Pending Queue

Delta enters DeltaStore pending, indexed by missing parent IDs. The delta is stored but not applied.

Fetch Missing

NodeManager sends a DeltaRequest stream to peers who likely have the missing parent. Peers respond with the missing delta payload.

Resolution

When all parents arrive, pending deltas are drained in topological order. Stale pending deltas that are never resolved are cleaned up periodically.

Config & Metrics

  • pending_deltas_count — gauge for deltas awaiting parents
  • delta_request_sent — counter for DeltaRequest messages emitted
  • stale_pending_cleanup_interval — periodic cleanup timer

7. Snapshot Sync Failure

sync stream

During a snapshot transfer the connection drops. The DeltaBuffer may have accumulated deltas during the transfer. On retry, the system determines whether partial state is reusable or a fresh snapshot is needed.

Source Node snapshot sender SnapshotStream Joining Node partial snapshot state DeltaBuffer (accumulating) deltas arriving during transfer Connection drops → retry New handshake: is partial state usable? compare root_hash of partial vs source's current state ✓ Resume from partial apply buffered deltas ✗ Fresh snapshot required discard partial, restart

Trigger

TCP/QUIC connection drops during a SnapshotStreamRequest transfer. The joining node has received partial snapshot data.

DeltaBuffer

While the snapshot was transferring, new deltas continued arriving via gossip. These are held in a DeltaBuffer with finite capacity. If the buffer fills, oldest deltas are dropped (sync_buffer_drops metric).

Retry Handshake

On reconnection, a new handshake compares the partial state's root_hash against the source's current state. If compatible, the transfer can resume from where it left off.

Recovery

If partial state is usable: apply buffered deltas and continue. If not: discard partial state, request a fresh snapshot from scratch. Invariant I6: buffer has finite capacity, drops oldest if full.

Config & Metrics

  • sync_buffer_capacity — max deltas held during snapshot transfer
  • sync_buffer_drops — counter for deltas dropped due to buffer overflow
  • snapshot_retry_count — counter for snapshot transfer retries
  • snapshot_transfer_bytes — histogram of snapshot sizes

8. Cascade on Member Removal

cascade governance

When a member is removed from a group, all their context identities and join tracking across every context in that group are cleaned up in one deterministic cascade operation.

MemberRemoved GroupOp applied GroupStore scan GroupMemberContext index lookup Context A identity removed Context B identity removed Context C identity removed … Context N identity removed Join tracking cleaned up for each context (GroupMemberContext rows deleted) Deterministic: same op on any node → identical cascade — verified by convergence tests

Trigger

A MemberRemoved GroupOp is applied to the GroupStore. The member is removed from the group's member list.

Index Scan

GroupStore scans the GroupMemberContext index to find all contexts the removed member had joined via this group. This index is maintained automatically when members join contexts within the group.

Cascade Removal

For each found context: the member's context identity is removed, their join tracking row is deleted. If the member was in 100 contexts, all 100 get cleaned up atomically within the single op apply.

Convergence Guarantee

The cascade is fully deterministic — the same MemberRemoved op on any node produces identical deletions. This is verified by the convergence test suite (21 two-node tests).

Config & Metrics

  • cascade_contexts_removed — counter for context memberships cleaned up per MemberRemoved op
  • cascade_duration_ms — histogram for time spent in cascade cleanup
  • No upper limit on contexts per member — cascade is O(n) in number of contexts the member joined