Metrics Reference
All Prometheus metrics organized by subsystem
Sync Metrics
Metrics emitted by the sync engine covering protocol selection, data transfer, phase timing, and invariant monitoring.
crates/node/src/sync/prometheus_metrics.rsTransfer & Volume
| Metric | Type | Labels | Description |
|---|---|---|---|
| sync_messages_sent | Family<Counter> | protocol | Total sync protocol messages sent |
| sync_bytes_sent | Family<Counter> | protocol | Total sync protocol bytes sent |
| sync_round_trips | Family<Counter> | protocol | Total sync round trips |
| sync_entities_transferred | Counter | — | Total entities transferred during sync |
Transfer metric usage notes
CRDT Operations
| Metric | Type | Labels | Description |
|---|---|---|---|
| sync_merges | Family<Counter> | protocol | Total CRDT merge operations |
| sync_comparisons | Counter | — | Total entity hash comparisons |
CRDT metric usage notes
Phase Timing
| Metric | Type | Labels | Description |
|---|---|---|---|
| sync_phase_duration_seconds | Family<Histogram> | phase | Duration of sync phases (buckets: 1ms, 2ms, 4ms, 8ms, 16ms, 32ms, 64ms, 128ms, 256ms, 512ms, 1s, 2s, 4s, 8s, 16s) |
| sync_duration_seconds | Family<Histogram> | protocol | Full sync session duration (buckets: 10ms, 20ms, 40ms, 80ms, 160ms, 320ms, 640ms, 1.28s, 2.56s, 5.12s, 10.24s, 20.48s, 40.96s, 81.92s, 160s) |
Timing metric usage notes
Invariant Monitoring
| Metric | Type | Labels | Description |
|---|---|---|---|
| sync_snapshot_blocked | Counter | — | Snapshot blocked on initialized nodes (invariant I5) |
| sync_verification_failures | Counter | — | Snapshot verification failures (invariant I7) |
| sync_lww_fallback | Counter | — | LWW fallback events |
| sync_buffer_drops | Counter | — | Delta buffer drop events (invariant I6 risk) |
Invariant metric usage notes
Protocol Selection & Outcomes
| Metric | Type | Labels | Description |
|---|---|---|---|
| sync_attempts | Family<Counter> | protocol | Total sync attempts by protocol |
| sync_successes | Family<Counter> | protocol | Successful syncs by protocol |
| sync_failures | Family<Counter> | protocol | Failed syncs by protocol |
| sync_protocol_selections | Family<Counter> | protocol | Protocol selection decisions |
Protocol outcome usage notes
Network Event Channel Metrics
Metrics from the bounded event channel between the network layer and node manager. All metrics are registered under the network_event_channel sub-registry.
crates/node/src/network_event_channel.rs| Metric | Type | Description |
|---|---|---|
| depth | Gauge | Current events waiting in channel |
| received_total | Counter | Total events sent to channel |
| processed_total | Counter | Total events received from channel |
| dropped_total | Counter | Events dropped (channel full) |
| processing_latency_seconds | Histogram | Event send-to-processing latency |
Channel metric usage notes
Context Metrics
Per-context execution metrics for monitoring WASM runtime performance.
crates/context/src/metrics.rs| Metric | Type | Labels | Description |
|---|---|---|---|
| execution_count | Family<Gauge> | context_id | Context runtime execution counter |
| execution_duration_seconds | Family<Histogram> | context_id | Execution duration (buckets: 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 512s, 1024s) |
Context metric usage notes
Governance metrics (planned)
These Prometheus-style names are reserved for future governance observability. They are not necessarily registered in the codebase yet; treat this as a specification placeholder.
| Metric | Type | Labels | Description |
|---|---|---|---|
| calimero_governance_ops_applied_total | Counter | — | Total governance operations applied successfully |
| calimero_governance_ops_rejected_total | Counter | reason (suggested) | Total governance operations rejected (authorization failures, stale state, validation, etc.) |
| calimero_governance_dag_heads | Gauge | group_id (suggested) | Current number of DAG heads per group |
| calimero_governance_heartbeat_mismatches_total | Counter | — | Heartbeat head comparison failures |
TEE Self-Purge Metrics
Observability for the self-purge cleanup that runs when a hardware-attested fleet node is evicted from its ReadOnlyTee role. Self-purge hard-deletes a node's own local rows and keys (signing-key material and, as of #2776, the per-group encryption keys) and is recovered across restarts by a marker plus a startup reconcile sweep. See Membership & Leave and TEE Fleet HA.
crates/governance-store/src/metrics.rs| Metric | Type | Labels | Description |
|---|---|---|---|
| self_purge_failures_total | Family<Counter> | branch, class | Self-purge local-state cleanup failures. branch ∈ {namespace, subgroup}; class ∈ {signing_key, context_cleanup}. The class="signing_key" series is the security-relevant one (forward-secrecy residue on disk); class="context_cleanup" is a best-effort dead-pointer leak and is informational. |
| self_purge_reconcile_total | Family<Counter> | outcome | Startup reconcile-sweep outcomes, one increment per marked namespace processed. outcome ∈ {reconciled, retained, cleared_stale, stale_clear_failed, skipped}. retained stuck across restarts means a signing-key purge keeps failing; read-uncertainty lands in skipped (not in self_purge_failures_total). |
| self_purge_events_dropped_total | Counter | — | Self-purge op-events dropped by the broadcast Lagged arm (subscriber fell behind). An upper bound on dropped eviction events, each of which writes no reconcile marker — un-recoverable on-disk residue (bounded; not a forward-secrecy hole, which is held by key rotation). |
Self-purge metric usage notes
Dashboard Queries
Example PromQL queries for common monitoring scenarios. Adapt the label selectors to match your deployment.
Sync Health Overview
sum(rate(sync_successes_total{protocol=~".*"}[5m])) by (protocol)
/ sum(rate(sync_attempts_total{protocol=~".*"}[5m])) by (protocol)
sum(rate(sync_failures_total[5m])) by (protocol) > 0.1
histogram_quantile(0.99, sum(rate(sync_duration_seconds_bucket[5m])) by (le, protocol))
Invariant Alerts
increase(sync_verification_failures_total[5m]) > 0
rate(sync_buffer_drops_total[5m]) > 0
increase(sync_snapshot_blocked_total[5m]) > 0
Network Channel Backpressure
network_event_channel_depth / 1000
rate(network_event_channel_dropped_total[1m]) > 0
histogram_quantile(0.99, sum(rate(network_event_channel_processing_latency_seconds_bucket[5m])) by (le))
Context Execution Performance
topk(5, rate(execution_count[5m]))
histogram_quantile(0.95, sum(rate(execution_duration_seconds_bucket[5m])) by (le, context_id))
sum(rate(execution_duration_seconds_bucket{le="32"}[5m])) by (context_id)
- sum(rate(execution_duration_seconds_bucket{le="1024"}[5m])) by (context_id)
Sync Bandwidth & Efficiency
rate(sync_bytes_sent_total[5m]) / rate(sync_messages_sent_total[5m])
rate(sync_entities_transferred_total[5m]) / sum(rate(sync_round_trips_total[5m]))
sum(rate(sync_protocol_selections_total[5m])) by (protocol)
/ ignoring(protocol) group_left sum(rate(sync_protocol_selections_total[5m]))
TEE Self-Purge Health
increase(context_group_store_self_purge_failures_total{class="signing_key"}[15m]) > 0
increase(context_group_store_self_purge_reconcile_total{outcome="retained"}[1h]) > 0
rate(context_group_store_self_purge_events_dropped_total[5m]) > 0