Metrics Reference

All Prometheus metrics organized by subsystem

sync metrics

channel metrics

context metrics

self-purge metrics

Sync Metrics

Metrics emitted by the sync engine covering protocol selection, data transfer, phase timing, and invariant monitoring.

crates/node/src/sync/prometheus_metrics.rs

Transfer & Volume

Metric	Type	Labels	Description
sync_messages_sent	Family<Counter>	protocol	Total sync protocol messages sent
sync_bytes_sent	Family<Counter>	protocol	Total sync protocol bytes sent
sync_round_trips	Family<Counter>	protocol	Total sync round trips
sync_entities_transferred	Counter	—	Total entities transferred during sync

Transfer metric usage notes

sync_messages_sent — High values per protocol indicate frequent sync activity. Compare across protocols to understand which sync strategy dominates. Alert if any single protocol accounts for >90% of messages (may indicate stuck fallback).

sync_bytes_sent — Use with sync_messages_sent to compute average message size. Snapshot protocol will naturally produce larger values. A sudden spike may indicate a node falling behind and triggering full snapshots.

sync_round_trips — Higher round trips for hash comparison is normal (binary search). For snapshot sync, this should be very low (1-2). High round trips on level-wise sync suggests deep DAG divergence.

sync_entities_transferred — Track the rate of this counter. Sustained high rates indicate active state divergence across the cluster. Should be near-zero during steady state.

CRDT Operations

Metric	Type	Labels	Description
sync_merges	Family<Counter>	protocol	Total CRDT merge operations
sync_comparisons	Counter	—	Total entity hash comparisons

CRDT metric usage notes

sync_merges — Each merge represents a conflict resolution. High merge rates relative to entity transfers indicate heavy concurrent writes. Useful for tuning write concurrency.

sync_comparisons — Hash comparisons are cheap but high volume indicates the hash-compare protocol is doing a lot of work. The ratio of comparisons to entities transferred measures sync efficiency (lower ratio = more divergence).

Phase Timing

Metric	Type	Labels	Description
sync_phase_duration_seconds	Family<Histogram>	phase	Duration of sync phases (buckets: 1ms, 2ms, 4ms, 8ms, 16ms, 32ms, 64ms, 128ms, 256ms, 512ms, 1s, 2s, 4s, 8s, 16s)
sync_duration_seconds	Family<Histogram>	protocol	Full sync session duration (buckets: 10ms, 20ms, 40ms, 80ms, 160ms, 320ms, 640ms, 1.28s, 2.56s, 5.12s, 10.24s, 20.48s, 40.96s, 81.92s, 160s)

Timing metric usage notes

sync_phase_duration_seconds — Break down which phase is the bottleneck. Phases include: discovery, comparison, transfer, and apply. If the "apply" phase is slow, storage writes may be the bottleneck. If "transfer" is slow, check network bandwidth.

sync_duration_seconds — The end-to-end duration of a complete sync session. Hash comparison should complete in <100ms; level-wise in <1s; snapshot may take 10s+. Alert on p99 > 30s for any protocol.

Invariant Monitoring

Metric	Type	Labels	Description
sync_snapshot_blocked	Counter	—	Snapshot blocked on initialized nodes (invariant I5)
sync_verification_failures	Counter	—	Snapshot verification failures (invariant I7)
sync_lww_fallback	Counter	—	LWW fallback events
sync_buffer_drops	Counter	—	Delta buffer drop events (invariant I6 risk)

Invariant metric usage notes

sync_snapshot_blocked — Should always be 0 in normal operation. Any increment means a snapshot sync was attempted on an already-initialized node, which is blocked by safety invariant I5. Investigate immediately if non-zero.

sync_verification_failures — Indicates post-snapshot state hash mismatch (invariant I7). Non-zero values suggest data corruption or non-deterministic execution. Requires urgent investigation.

sync_lww_fallback — LWW (last-writer-wins) fallback is a safety net when normal merge fails. Occasional occurrences are acceptable; sustained increases may indicate a bug in CRDT merge logic.

sync_buffer_drops — Delta buffer is bounded; drops mean incoming deltas are arriving faster than they can be processed (invariant I6 risk). May lead to data loss requiring catch-up sync. Alert on any increment.

Protocol Selection & Outcomes

Metric	Type	Labels	Description
sync_attempts	Family<Counter>	protocol	Total sync attempts by protocol
sync_successes	Family<Counter>	protocol	Successful syncs by protocol
sync_failures	Family<Counter>	protocol	Failed syncs by protocol
sync_protocol_selections	Family<Counter>	protocol	Protocol selection decisions

Protocol outcome usage notes

sync_attempts / sync_successes / sync_failures — Compute success rate per protocol: sync_successes / sync_attempts. Hash comparison should have >95% success rate. If snapshot failures are non-zero, check network stability and storage health.

sync_protocol_selections — Shows which protocol the adaptive selector chose. Healthy clusters should see mostly hash comparison. Increasing snapshot selections indicates growing state divergence across nodes.

Network Event Channel Metrics

Metrics from the bounded event channel between the network layer and node manager. All metrics are registered under the network_event_channel sub-registry.

crates/node/src/network_event_channel.rs

Metric	Type	Description
depth	Gauge	Current events waiting in channel
received_total	Counter	Total events sent to channel
processed_total	Counter	Total events received from channel
dropped_total	Counter	Events dropped (channel full)
processing_latency_seconds	Histogram	Event send-to-processing latency

Channel metric usage notes

depth — Instantaneous queue depth. Sustained high values (>80% of capacity) indicate the NodeManager can't keep up with incoming network events. May cause drops.

received_total / processed_total — The difference received_total - processed_total should equal depth + dropped_total at any point. Growing divergence indicates processing lag.

dropped_total — Any increment means the channel was full and events were lost. Dropped events may cause missed gossip messages or delayed sync triggers. Alert on any non-zero rate.

processing_latency_seconds — Measures end-to-end event latency from when NetworkManager enqueues to when NodeManager dequeues. p99 > 500ms indicates backpressure. p50 should be < 10ms in healthy systems.

Context Metrics

Per-context execution metrics for monitoring WASM runtime performance.

crates/context/src/metrics.rs

Metric	Type	Labels	Description
execution_count	Family<Gauge>	context_id	Context runtime execution counter
execution_duration_seconds	Family<Histogram>	context_id	Execution duration (buckets: 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 512s, 1024s)

Context metric usage notes

execution_count — A gauge tracking cumulative executions per context. Use rate() to get executions per second. Contexts with unusually high execution rates may be under heavy client load or in a retry loop.

execution_duration_seconds — Measures WASM method execution time including host function calls and storage I/O. Bucket boundaries start at 1s — anything in the first bucket is healthy. Executions > 64s may indicate infinite loops or expensive storage operations. Consider setting WASM gas limits.

Governance metrics (planned)

These Prometheus-style names are reserved for future governance observability. They are not necessarily registered in the codebase yet; treat this as a specification placeholder.

Metric	Type	Labels	Description
calimero_governance_ops_applied_total	Counter	—	Total governance operations applied successfully
calimero_governance_ops_rejected_total	Counter	reason (suggested)	Total governance operations rejected (authorization failures, stale state, validation, etc.)
calimero_governance_dag_heads	Gauge	group_id (suggested)	Current number of DAG heads per group
calimero_governance_heartbeat_mismatches_total	Counter	—	Heartbeat head comparison failures

Planned — Implementations may use different label sets or histograms; align exported names with this table when adding instrumentation.

TEE Self-Purge Metrics

Observability for the self-purge cleanup that runs when a hardware-attested fleet node is evicted from its ReadOnlyTee role. Self-purge hard-deletes a node's own local rows and keys (signing-key material and, as of #2776, the per-group encryption keys) and is recovered across restarts by a marker plus a startup reconcile sweep. See Membership & Leave and TEE Fleet HA.

crates/governance-store/src/metrics.rs

Naming — These families register under the context → group_store sub-registries, so the exported Prometheus names carry the context_group_store_ prefix (e.g. context_group_store_self_purge_failures_total). The registered names are shown below.

Metric	Type	Labels	Description
self_purge_failures_total	Family<Counter>	branch, class	Self-purge local-state cleanup failures. branch ∈ {namespace, subgroup}; class ∈ {signing_key, context_cleanup}. The class="signing_key" series is the security-relevant one (forward-secrecy residue on disk); class="context_cleanup" is a best-effort dead-pointer leak and is informational.
self_purge_reconcile_total	Family<Counter>	outcome	Startup reconcile-sweep outcomes, one increment per marked namespace processed. outcome ∈ {reconciled, retained, cleared_stale, stale_clear_failed, skipped}. retained stuck across restarts means a signing-key purge keeps failing; read-uncertainty lands in skipped (not in self_purge_failures_total).
self_purge_events_dropped_total	Counter	—	Self-purge op-events dropped by the broadcast Lagged arm (subscriber fell behind). An upper bound on dropped eviction events, each of which writes no reconcile marker — un-recoverable on-disk residue (bounded; not a forward-secrecy hole, which is held by key rotation).

Self-purge metric usage notes

self_purge_failures_total — Alert on any non-zero rate of class="signing_key": it means a node's own private signing-key material may linger on disk after eviction, pending the reconcile sweep. class="context_cleanup" is informational (orphaned dead-pointer rows); namespace deletion and unsubscribe still proceed.

self_purge_reconcile_total — In steady state the startup sweep should be quiet. A namespace persistently in outcome="retained" across restarts indicates a signing-key purge that keeps failing — investigate. stale_clear_failed is benign (re-evaluated next restart); skipped means the sweep declined to purge under read uncertainty (never purge on uncertainty).

self_purge_events_dropped_total — Any non-zero rate means the self-purge subscriber fell more than the broadcast capacity behind and some evictions may have left un-reconcilable on-disk residue. Bounded and not a forward-secrecy hole, but worth alerting as a correctness signal.

Dashboard Queries

Example PromQL queries for common monitoring scenarios. Adapt the label selectors to match your deployment.

Sync Health Overview

# Sync success rate by protocol (5m window)
sum(rate(sync_successes_total{protocol=~".*"}[5m])) by (protocol)
/ sum(rate(sync_attempts_total{protocol=~".*"}[5m])) by (protocol)

# Sync failure rate — alert when > 0.1/s
sum(rate(sync_failures_total[5m])) by (protocol) > 0.1

# P99 sync duration by protocol
histogram_quantile(0.99, sum(rate(sync_duration_seconds_bucket[5m])) by (le, protocol))

Invariant Alerts

# CRITICAL: Any snapshot verification failure
increase(sync_verification_failures_total[5m]) > 0

# WARNING: Delta buffer drops (I6 risk)
rate(sync_buffer_drops_total[5m]) > 0

# WARNING: Snapshot blocked on initialized node (I5)
increase(sync_snapshot_blocked_total[5m]) > 0

Network Channel Backpressure

# Channel utilization (assuming capacity of 1000)
network_event_channel_depth / 1000

# Event drop rate — alert on any drops
rate(network_event_channel_dropped_total[1m]) > 0

# P99 event processing latency
histogram_quantile(0.99, sum(rate(network_event_channel_processing_latency_seconds_bucket[5m])) by (le))

Context Execution Performance

# Top 5 contexts by execution rate
topk(5, rate(execution_count[5m]))

# P95 execution duration per context
histogram_quantile(0.95, sum(rate(execution_duration_seconds_bucket[5m])) by (le, context_id))

# Slow executions (> 30s) per context
sum(rate(execution_duration_seconds_bucket{le="32"}[5m])) by (context_id)
- sum(rate(execution_duration_seconds_bucket{le="1024"}[5m])) by (context_id)

Sync Bandwidth & Efficiency

# Bytes per message (average message size)
rate(sync_bytes_sent_total[5m]) / rate(sync_messages_sent_total[5m])

# Sync efficiency: entities per round trip
rate(sync_entities_transferred_total[5m]) / sum(rate(sync_round_trips_total[5m]))

# Protocol selection distribution
sum(rate(sync_protocol_selections_total[5m])) by (protocol)
/ ignoring(protocol) group_left sum(rate(sync_protocol_selections_total[5m]))

TEE Self-Purge Health

# CRITICAL: signing-key residue left on disk after eviction
increase(context_group_store_self_purge_failures_total{class="signing_key"}[15m]) > 0

# WARNING: a namespace stuck "retained" — repeated purge failure
increase(context_group_store_self_purge_reconcile_total{outcome="retained"}[1h]) > 0

# WARNING: dropped self-purge events — un-recoverable residue risk
rate(context_group_store_self_purge_events_dropped_total[5m]) > 0