Metrics Reference

All Prometheus metrics organized by subsystem

16
sync metrics
5
channel metrics
2
context metrics
3
self-purge metrics

Sync Metrics

Metrics emitted by the sync engine covering protocol selection, data transfer, phase timing, and invariant monitoring.

crates/node/src/sync/prometheus_metrics.rs

Transfer & Volume

MetricTypeLabelsDescription
sync_messages_sent Family<Counter> protocol Total sync protocol messages sent
sync_bytes_sent Family<Counter> protocol Total sync protocol bytes sent
sync_round_trips Family<Counter> protocol Total sync round trips
sync_entities_transferred Counter Total entities transferred during sync
Transfer metric usage notes
sync_messages_sent — High values per protocol indicate frequent sync activity. Compare across protocols to understand which sync strategy dominates. Alert if any single protocol accounts for >90% of messages (may indicate stuck fallback).
sync_bytes_sent — Use with sync_messages_sent to compute average message size. Snapshot protocol will naturally produce larger values. A sudden spike may indicate a node falling behind and triggering full snapshots.
sync_round_trips — Higher round trips for hash comparison is normal (binary search). For snapshot sync, this should be very low (1-2). High round trips on level-wise sync suggests deep DAG divergence.
sync_entities_transferred — Track the rate of this counter. Sustained high rates indicate active state divergence across the cluster. Should be near-zero during steady state.

CRDT Operations

MetricTypeLabelsDescription
sync_merges Family<Counter> protocol Total CRDT merge operations
sync_comparisons Counter Total entity hash comparisons
CRDT metric usage notes
sync_merges — Each merge represents a conflict resolution. High merge rates relative to entity transfers indicate heavy concurrent writes. Useful for tuning write concurrency.
sync_comparisons — Hash comparisons are cheap but high volume indicates the hash-compare protocol is doing a lot of work. The ratio of comparisons to entities transferred measures sync efficiency (lower ratio = more divergence).

Phase Timing

MetricTypeLabelsDescription
sync_phase_duration_seconds Family<Histogram> phase Duration of sync phases (buckets: 1ms, 2ms, 4ms, 8ms, 16ms, 32ms, 64ms, 128ms, 256ms, 512ms, 1s, 2s, 4s, 8s, 16s)
sync_duration_seconds Family<Histogram> protocol Full sync session duration (buckets: 10ms, 20ms, 40ms, 80ms, 160ms, 320ms, 640ms, 1.28s, 2.56s, 5.12s, 10.24s, 20.48s, 40.96s, 81.92s, 160s)
Timing metric usage notes
sync_phase_duration_seconds — Break down which phase is the bottleneck. Phases include: discovery, comparison, transfer, and apply. If the "apply" phase is slow, storage writes may be the bottleneck. If "transfer" is slow, check network bandwidth.
sync_duration_seconds — The end-to-end duration of a complete sync session. Hash comparison should complete in <100ms; level-wise in <1s; snapshot may take 10s+. Alert on p99 > 30s for any protocol.

Invariant Monitoring

MetricTypeLabelsDescription
sync_snapshot_blocked Counter Snapshot blocked on initialized nodes (invariant I5)
sync_verification_failures Counter Snapshot verification failures (invariant I7)
sync_lww_fallback Counter LWW fallback events
sync_buffer_drops Counter Delta buffer drop events (invariant I6 risk)
Invariant metric usage notes
sync_snapshot_blocked — Should always be 0 in normal operation. Any increment means a snapshot sync was attempted on an already-initialized node, which is blocked by safety invariant I5. Investigate immediately if non-zero.
sync_verification_failures — Indicates post-snapshot state hash mismatch (invariant I7). Non-zero values suggest data corruption or non-deterministic execution. Requires urgent investigation.
sync_lww_fallback — LWW (last-writer-wins) fallback is a safety net when normal merge fails. Occasional occurrences are acceptable; sustained increases may indicate a bug in CRDT merge logic.
sync_buffer_drops — Delta buffer is bounded; drops mean incoming deltas are arriving faster than they can be processed (invariant I6 risk). May lead to data loss requiring catch-up sync. Alert on any increment.

Protocol Selection & Outcomes

MetricTypeLabelsDescription
sync_attempts Family<Counter> protocol Total sync attempts by protocol
sync_successes Family<Counter> protocol Successful syncs by protocol
sync_failures Family<Counter> protocol Failed syncs by protocol
sync_protocol_selections Family<Counter> protocol Protocol selection decisions
Protocol outcome usage notes
sync_attempts / sync_successes / sync_failures — Compute success rate per protocol: sync_successes / sync_attempts. Hash comparison should have >95% success rate. If snapshot failures are non-zero, check network stability and storage health.
sync_protocol_selections — Shows which protocol the adaptive selector chose. Healthy clusters should see mostly hash comparison. Increasing snapshot selections indicates growing state divergence across nodes.

Network Event Channel Metrics

Metrics from the bounded event channel between the network layer and node manager. All metrics are registered under the network_event_channel sub-registry.

crates/node/src/network_event_channel.rs
MetricTypeDescription
depth Gauge Current events waiting in channel
received_total Counter Total events sent to channel
processed_total Counter Total events received from channel
dropped_total Counter Events dropped (channel full)
processing_latency_seconds Histogram Event send-to-processing latency
Channel metric usage notes
depth — Instantaneous queue depth. Sustained high values (>80% of capacity) indicate the NodeManager can't keep up with incoming network events. May cause drops.
received_total / processed_total — The difference received_total - processed_total should equal depth + dropped_total at any point. Growing divergence indicates processing lag.
dropped_total — Any increment means the channel was full and events were lost. Dropped events may cause missed gossip messages or delayed sync triggers. Alert on any non-zero rate.
processing_latency_seconds — Measures end-to-end event latency from when NetworkManager enqueues to when NodeManager dequeues. p99 > 500ms indicates backpressure. p50 should be < 10ms in healthy systems.

Context Metrics

Per-context execution metrics for monitoring WASM runtime performance.

crates/context/src/metrics.rs
MetricTypeLabelsDescription
execution_count Family<Gauge> context_id Context runtime execution counter
execution_duration_seconds Family<Histogram> context_id Execution duration (buckets: 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 512s, 1024s)
Context metric usage notes
execution_count — A gauge tracking cumulative executions per context. Use rate() to get executions per second. Contexts with unusually high execution rates may be under heavy client load or in a retry loop.
execution_duration_seconds — Measures WASM method execution time including host function calls and storage I/O. Bucket boundaries start at 1s — anything in the first bucket is healthy. Executions > 64s may indicate infinite loops or expensive storage operations. Consider setting WASM gas limits.

Governance metrics (planned)

These Prometheus-style names are reserved for future governance observability. They are not necessarily registered in the codebase yet; treat this as a specification placeholder.

MetricTypeLabelsDescription
calimero_governance_ops_applied_total Counter Total governance operations applied successfully
calimero_governance_ops_rejected_total Counter reason (suggested) Total governance operations rejected (authorization failures, stale state, validation, etc.)
calimero_governance_dag_heads Gauge group_id (suggested) Current number of DAG heads per group
calimero_governance_heartbeat_mismatches_total Counter Heartbeat head comparison failures
Planned — Implementations may use different label sets or histograms; align exported names with this table when adding instrumentation.

TEE Self-Purge Metrics

Observability for the self-purge cleanup that runs when a hardware-attested fleet node is evicted from its ReadOnlyTee role. Self-purge hard-deletes a node's own local rows and keys (signing-key material and, as of #2776, the per-group encryption keys) and is recovered across restarts by a marker plus a startup reconcile sweep. See Membership & Leave and TEE Fleet HA.

crates/governance-store/src/metrics.rs
Naming — These families register under the contextgroup_store sub-registries, so the exported Prometheus names carry the context_group_store_ prefix (e.g. context_group_store_self_purge_failures_total). The registered names are shown below.
MetricTypeLabelsDescription
self_purge_failures_total Family<Counter> branch, class Self-purge local-state cleanup failures. branch ∈ {namespace, subgroup}; class ∈ {signing_key, context_cleanup}. The class="signing_key" series is the security-relevant one (forward-secrecy residue on disk); class="context_cleanup" is a best-effort dead-pointer leak and is informational.
self_purge_reconcile_total Family<Counter> outcome Startup reconcile-sweep outcomes, one increment per marked namespace processed. outcome ∈ {reconciled, retained, cleared_stale, stale_clear_failed, skipped}. retained stuck across restarts means a signing-key purge keeps failing; read-uncertainty lands in skipped (not in self_purge_failures_total).
self_purge_events_dropped_total Counter Self-purge op-events dropped by the broadcast Lagged arm (subscriber fell behind). An upper bound on dropped eviction events, each of which writes no reconcile marker — un-recoverable on-disk residue (bounded; not a forward-secrecy hole, which is held by key rotation).
Self-purge metric usage notes
self_purge_failures_total — Alert on any non-zero rate of class="signing_key": it means a node's own private signing-key material may linger on disk after eviction, pending the reconcile sweep. class="context_cleanup" is informational (orphaned dead-pointer rows); namespace deletion and unsubscribe still proceed.
self_purge_reconcile_total — In steady state the startup sweep should be quiet. A namespace persistently in outcome="retained" across restarts indicates a signing-key purge that keeps failing — investigate. stale_clear_failed is benign (re-evaluated next restart); skipped means the sweep declined to purge under read uncertainty (never purge on uncertainty).
self_purge_events_dropped_total — Any non-zero rate means the self-purge subscriber fell more than the broadcast capacity behind and some evictions may have left un-reconcilable on-disk residue. Bounded and not a forward-secrecy hole, but worth alerting as a correctness signal.

Dashboard Queries

Example PromQL queries for common monitoring scenarios. Adapt the label selectors to match your deployment.

Sync Health Overview

# Sync success rate by protocol (5m window)
sum(rate(sync_successes_total{protocol=~".*"}[5m])) by (protocol)
/ sum(rate(sync_attempts_total{protocol=~".*"}[5m])) by (protocol)
# Sync failure rate — alert when > 0.1/s
sum(rate(sync_failures_total[5m])) by (protocol) > 0.1
# P99 sync duration by protocol
histogram_quantile(0.99, sum(rate(sync_duration_seconds_bucket[5m])) by (le, protocol))

Invariant Alerts

# CRITICAL: Any snapshot verification failure
increase(sync_verification_failures_total[5m]) > 0
# WARNING: Delta buffer drops (I6 risk)
rate(sync_buffer_drops_total[5m]) > 0
# WARNING: Snapshot blocked on initialized node (I5)
increase(sync_snapshot_blocked_total[5m]) > 0

Network Channel Backpressure

# Channel utilization (assuming capacity of 1000)
network_event_channel_depth / 1000
# Event drop rate — alert on any drops
rate(network_event_channel_dropped_total[1m]) > 0
# P99 event processing latency
histogram_quantile(0.99, sum(rate(network_event_channel_processing_latency_seconds_bucket[5m])) by (le))

Context Execution Performance

# Top 5 contexts by execution rate
topk(5, rate(execution_count[5m]))
# P95 execution duration per context
histogram_quantile(0.95, sum(rate(execution_duration_seconds_bucket[5m])) by (le, context_id))
# Slow executions (> 30s) per context
sum(rate(execution_duration_seconds_bucket{le="32"}[5m])) by (context_id)
- sum(rate(execution_duration_seconds_bucket{le="1024"}[5m])) by (context_id)

Sync Bandwidth & Efficiency

# Bytes per message (average message size)
rate(sync_bytes_sent_total[5m]) / rate(sync_messages_sent_total[5m])
# Sync efficiency: entities per round trip
rate(sync_entities_transferred_total[5m]) / sum(rate(sync_round_trips_total[5m]))
# Protocol selection distribution
sum(rate(sync_protocol_selections_total[5m])) by (protocol)
/ ignoring(protocol) group_left sum(rate(sync_protocol_selections_total[5m]))

TEE Self-Purge Health

# CRITICAL: signing-key residue left on disk after eviction
increase(context_group_store_self_purge_failures_total{class="signing_key"}[15m]) > 0
# WARNING: a namespace stuck "retained" — repeated purge failure
increase(context_group_store_self_purge_reconcile_total{outcome="retained"}[1h]) > 0
# WARNING: dropped self-purge events — un-recoverable residue risk
rate(context_group_store_self_purge_events_dropped_total[5m]) > 0