TEE Fleet High Availability
Hardware-attested read replicas that hold group keys and serve a namespace's data
What fleet HA is
A TEE fleet is a pool of hardware-attested replica nodes that hold a namespace's group keys and serve reads for its data. Each replica joins as a ReadOnlyTee member — it can decrypt and answer sync requests, but it can never write. The fleet exists so that a namespace's data stays available and queryable even when the human-operated owner node is offline, NAT'd, or asleep.
Two planes cooperate. The control plane (the mdma service) is the source of truth for entitlement: a namespace is configured or paid for once, and mdma decides which fleet nodes should serve it. A fleet sidecar (shipped by the mero-tee repo) runs alongside each merod replica and drives the per-node lifecycle by calling meroctl. Core itself owns the on-chain governance: attestation verification, admission, key delivery, and purge.
The key split to keep straight: entitlement is per-namespace (paid/configured once), but admission is per-subgroup (each Restricted subgroup the replica should read needs its own membership row and key). The sections below walk the full lifecycle — fleet-join, admission, key delivery, transparent subgroup follow-on, and disable/leave/purge.
Roles & boundaries
The ReadOnlyTee role
- Directly-rowed, never inherited. A ReadOnlyTee membership is always a stored (member, group) row written by admission. It is never conferred by the inherited-membership parent-walk — a replica is only ever a member of the scopes it was explicitly admitted to.
- Read-only. The role cannot author governance ops or state deltas. It exists to decrypt and serve, not to mutate.
- Auto-follow on. Admission sets both auto-follow flags (contexts and subgroups) to true, so a replica tracks new contexts in the scopes it holds. See Auto-Follow for the propagation machinery.
The namespace as the HA boundary
A namespace is its own root group — there is no separate Namespace type; the namespace is the no-parent ContextGroup at the top of a strict tree (see Membership & Leave for the hierarchy recap). Fleet HA is scoped to a namespace: entitlement, the admission policy, and the disable/purge boundary all sit at the namespace root. A replica admitted at the root serves the namespace; per-subgroup rows extend its reach into Restricted children.
Two distinct key families
Do not conflate them:
- Per-group encryption keys (GroupKeyring) — the AES keys that protect a group's replicated data. These are delivered to a replica after admission and are the subject of most of this page.
- The node's own storage / disk key — the AES-256 key merod fetches from the KMS at startup to encrypt its datastore and blobstore on disk. That is the subject of TEE Mode, a separate concern from group keys.
Fleet-join & attestation announce
The fleet sidecar starts a replica's join by calling meroctl tee fleet-join, which hits POST /admin-api/tee/fleet-join. The handler (crates/server/src/admin/handlers/tee/fleet_join.rs):
- Resolves the node's namespace identity (the keypair the replica joins under).
- Builds report_data = nonce(32) || Sha256(pubkey), binding a fresh random nonce to the node's public key.
- Calls generate_attestation to produce a real TDX quote (crates/tee-attestation/src/generate.rs). Mock quotes are rejected on this path — fleet-join is production hardware only.
- Broadcasts BroadcastMessage::TeeAttestationAnnounce { quote_bytes, public_key, nonce, node_type: ReadOnly } on the gossip topic ns/<hex(namespace_id)>.
A single publish into an empty mesh is silently dropped — gossipsub has no replay, and at boot the replica may have no peers yet. The handler defends against this by re-announcing: it republishes the announce roughly every 2 s for up to ~30 s per call, issues a namespace bootstrap pull each cycle to seed the mesh, and polls for its own admission between announces. If 30 s elapse without admission the call returns; the sidecar retries.
Announce loop, step by step
- Generate nonce, build report_data, generate the TDX quote bound to the namespace-identity pubkey.
- Publish TeeAttestationAnnounce on ns/<hex(namespace_id)>.
- Sleep ~2 s; run a namespace bootstrap pull to populate the mesh.
- Poll: has a verifier written a ReadOnlyTee row for this node yet?
- If not and the ~30 s budget remains, re-announce and loop. (No replay — every publish is fresh into whatever mesh now exists.)
- On admission, fall through to joining contexts and publishing the self-signed auto-follow op (see Auto-Follow).
Verifier admission & policy
A verifier is any existing member of the namespace that receives the announce. The inbound handler is crates/node/src/handlers/tee_attestation_admission.rs. It verifies the quote before trusting anything: verify_attestation (crates/tee-attestation/src/verify.rs) checks the DCAP signature, that the nonce in report_data is fresh, and the measurements. Only then does it call admit_tee_node (crates/context/src/handlers/admit_tee_node.rs).
Policy is namespace-scoped
The admission policy lives on the namespace root, never on a subgroup. read_tee_admission_policy (crates/governance-store/src/tee.rs) resolves whatever scope it is given up to the namespace root before reading. Any policy bytes attached to a subgroup are inert — see the Auto-Follow page's policy-scope note for why subgroup policies were deliberately disabled.
Admission checks
- Measurement allowlist. allowed_mrtd must be non-empty (an empty allowlist fails closed). The quote's mrtd, rtmr0..3, and tcb_status are checked against their allowlists.
- Mock gate. Mock quotes are admitted only when accept_mock is set — a development-only escape hatch.
- Per-group quote-hash replay. is_quote_hash_used rejects a quote whose hash was already consumed in this scope, blocking replay of a captured announce.
- Idempotent. If a direct row for this member already exists, admission is a no-op rather than an error — safe under the re-announce loop.
What admission writes
On success the verifier writes a ReadOnlyTee direct row plus a signing key for the member, and publishes GroupOp::MemberJoinedViaTeeAttestation. That op propagating is exactly what the joiner's poll in fleet_join.rs is waiting on.
Key delivery & recovery
A row alone is not enough — the replica needs the group's encryption key to decrypt anything. After admission, a key-holder (the admitting verifier, or any member that holds the key) delivers it.
The one-shot delivery
The key-holder calls deliver_group_key_to_member, which publishes NamespaceOp::Root(RootOp::KeyDelivery { group_id, envelope }) on the namespace DAG. The envelope is ECDH-wrapped for the recipient's public key. The recipient applies it via apply_received_group_key and stores the key.
The durable recovery pull
A one-shot delivery can be missed (offline at delivery time, dropped broadcast). The joiner side therefore has a durable fallback: recover_missing_group_keys (crates/node/src/sync/manager/namespace_sync.rs) sends a GroupKeyRequest; a member-peer that holds the key answers with GroupKeyResponse. The responder's authorization is role-agnostic, so it will serve a ReadOnlyTee replica just as it would any member. The recovery pull runs:
- at the end of every namespace sync,
- on an interval, and
- on receipt of the relevant gossip event.
This turns key acquisition from a fragile single broadcast into a self-healing loop.
Open subgroups need no per-subgroup key
Data in an Open-chain subgroup is encrypted under the namespace key, not a distinct per-subgroup key. A replica admitted at the namespace root already holds that key, so it can read Open subgroups with no additional delivery. Per-subgroup keys only matter for Restricted subgroups, which is what the next section is about.
Transparent per-subgroup admission (PR #2772)
Entitlement is per-namespace, but a Restricted subgroup is a private island: a root-admitted replica has no row in it and no key for it. Manually re-attesting into every Restricted subgroup would be brittle. PR #2772 makes this transparent — the replica is folded into the Restricted subgroups it should read without a fresh quote.
Scope: Restricted subgroups only. Open subgroups are already readable via inherited membership and the namespace key, so they need nothing here.
How it works
A key-holder-side subscriber (tee_subgroup_admit) reacts to two op events:
- OpEvent::SubgroupCreated — a new Restricted subgroup appears. The subscriber admits the namespace's existing root ReadOnlyTee members into it.
- OpEvent::TeeMemberAdmitted at the root — a new TEE replica joins the namespace. The subscriber admits it into the Restricted subgroups this node holds keys for.
Crucially, no new attestation is requested. The member's already-verified verdict is read back from the root op log and reused. Both branches call the same admit_tee_node used at the root — and because the actor holds the key, that call writes the row and delivers the per-subgroup key in one step.
Two safety details
- Root-only loop guard. The subscriber acts only at the root, preventing fan-out echo (an admission triggering an event that triggers another admission).
- Bounded wake-then-reread. The apply path can emit the op event before the op log has durably persisted it. A bounded retry absorbs that timing window so the verdict read-back does not race the persist.
Disable → leave → purge
Decommissioning a replica from a namespace is a coordinated, cross-repo flow. The control plane (mdma) is the source of truth and never reaches into the node directly — it uses soft disable.
Soft disable (control plane)
mdma drops the namespace from the fleet node's should-join assignments. That is the entire control-plane action: there is no kill switch sent to core. The mero-tee fleet sidecar notices the namespace has disappeared from its assignments — but only on a confirmed-good poll, so a transient assignment-fetch failure does not trigger an accidental teardown.
Leave (node, via sidecar)
On a confirmed drop, the sidecar runs meroctl namespace leave <namespace_id>. That publishes GroupOp::MemberLeft (crates/governance-store/src/ops/group/member_left.rs), which cascades: it removes the node's direct row at the root and in every subgroup it held a row in, emitting TeeMemberRemoved per subgroup. (General leave/eviction semantics are covered in Membership & Leave.)
Purge (node, role-scoped)
self_purge (crates/context/src/self_purge.rs) reacts to those removals. It is role-scoped: only ReadOnlyTee removals hard-purge; a non-TEE removal stays a soft-leave. For a TEE removal it runs PurgeAction::Namespace, which cascades the whole subtree and:
- deletes local replicated data,
- deletes the signing keys,
- deletes the AES group encryption keys (PR #2776), and
- unsubscribes from the namespace gossip topic.
A purge that is interrupted (crash mid-cascade) is not lost: a durable pending-self-purge marker plus a startup reconcile sweep complete it on the next restart (crates/governance-store/src/local_state.rs).
Disable → purge sequence
- mdma drops the namespace from the node's should-join assignments (soft disable).
- Sidecar observes the drop on a confirmed-good poll and runs meroctl namespace leave.
- Core publishes GroupOp::MemberLeft; it cascades, emitting TeeMemberRemoved for the root and each subgroup row.
- self_purge sees the ReadOnlyTee removals and runs PurgeAction::Namespace.
- Local data + signing keys are deleted (plus the AES group keys (PR #2776)); the node unsubscribes from the gossip topic.
- If interrupted, the pending-self-purge marker + startup reconcile sweep finish the purge after restart.
Cross-repo contract. The disable decision and the assignment feed are owned by mdma; the leave trigger lives in the mero-tee sidecar; core owns the leave op, the cascade, and the purge. Changing the shape of any of these requires coordinating across the three repos.
Operator notes
- Purge observability. Watch self_purge_failures_total, self_purge_reconcile_total, and self_purge_events_dropped_total. A rising failures or dropped-events counter means purges are not completing cleanly; reconcile counts confirm the startup sweep is doing its job. (See Metrics Reference.)
- MRTD alignment. The node image's measurements and the control plane's allowed_mrtd allowlist must agree, or every fleet-join fails admission. After a node-image bump, update the namespace's admission policy before expecting replicas to join. Compare against the published measurements as in TEE Mode.
- Two key families. A replica that can fetch its KMS disk key but cannot decrypt group data has a group-key problem (delivery/recovery), not a KMS problem — and vice versa. Diagnose the right family.
- Restricted vs Open reads. If a replica reads some subgroups but not others, the gap is almost always missing per-subgroup admission/keys for Restricted children — the transparent subgroup admission (PR #2772) is what closes it.
Related
- TEE Mode — the node's own storage/disk KMS key and the startup attestation flow (distinct from group keys).
- Membership & Leave — general leave and eviction semantics that the fleet leave path builds on.
- Auto-Follow — how an admitted replica tracks new contexts and subgroups, and the namespace-scoped policy rationale.
- Glossary — definitions for namespace, group, role, and quote terms used above.
- Producing repos: mero-tee (the fleet sidecar and node image) and mdma (the control plane that owns entitlement and disable). Their internals are out of scope for core; only the contract surfaces named on this page are relied upon.