Operational Runbooks
Step-by-step procedures for deploying, verifying, and rolling out KMS and node images
Phala KMS Runbook
Deploy and verify mero-kms-phala on Phala Cloud CVMs. Covers release verification, digest-pinned deployment, merod attestation configuration, and runtime health checks.
Prerequisites
- GitHub release access to calimero-network/mero-tee
- Phala Cloud account with CVM provisioning permissions
- CLI tools: jq, curl, sha256sum, docker
- Local clone of the mero-tee repository with verify scripts
- Target KMS version tag (e.g. v0.3.0)
- Target profile: debug, debug-read-only, or locked-read-only
Step 1 — Verify Release Assets
Before deploying anything, verify the integrity and completeness of the release.
Run the verification script
./scripts/release/verify-release-assets.sh v0.3.0
This script downloads the release artifacts, checks SHA-256 checksums, verifies cosign signatures, validates the SBOM, and confirms that the attestation policy JSON is well-formed.
Confirm expected artifacts
The release must include:
- kms-phala-attestation-policy.{profile}.json — attestation policy for the target profile
- checksums.txt — SHA-256 hashes of all artifacts
- checksums.txt.sig — cosign signature over checksums
- sbom.spdx.json — software bill of materials
- Docker image digest in the release notes
Record the image digest
Extract the digest-pinned image reference from the release notes. You will use this in the next step. Format: ghcr.io/calimero-network/mero-kms-phala@sha256:abc123...
Step 2 — Deploy Digest-Pinned Image
Deploy using the exact digest from the verified release. Never use mutable tags in production.
Docker Compose configuration
version: "3.8"
services:
kms:
image: ghcr.io/calimero-network/mero-kms-phala@sha256:<DIGEST>
environment:
- MERO_KMS_VERSION=v0.3.0
- MERO_KMS_PROFILE=locked-read-only
- DSTACK_SOCKET_PATH=/var/run/dstack.sock
- ENFORCE_MEASUREMENT_POLICY=true
volumes:
- /var/run/dstack.sock:/var/run/dstack.sock
ports:
- "8080:8080"
Required environment variables
Deploy to Phala Cloud
Upload the Compose file to the Phala Cloud CVM dashboard or deploy via CLI. Wait for the container to reach Running state.
Step 3 — Configure merod Attestation
After KMS is running, configure merod nodes to use this KMS for attestation and key release.
Generate merod KMS configuration
./scripts/generate-merod-kms-config.sh \
--kms-url https://<kms-host>:8080 \
--profile locked-read-only \
--output merod-kms.json
Apply configuration to merod
./scripts/apply-merod-kms-config.sh \
--config merod-kms.json \
--node <node-id>
Profile compatibility
The KMS enforces profile-scoped policy. A KMS instance configured with MERO_KMS_PROFILE=locked-read-only will only release keys to nodes whose RTMR3 measurement matches the locked-read-only profile. This ensures cohort separation: debug nodes cannot obtain keys from a production KMS.
Step 4 — Runtime Checks
Verify the deployed KMS is healthy and producing valid attestation quotes.
Health check
# Expected: {"status":"ok","version":"v0.3.0","profile":"locked-read-only"}
Attestation endpoint
# Returns TDX quote from the KMS enclave for mutual verification
Verify the returned quote against expected measurements. The quote should contain MRTD and RTMR values matching the published policy for this version.
Quote verification
./scripts/verify-kms-quote.sh --url https://<kms-host>:8080
The script fetches POST /attest, extracts the TDX quote, and submits it to ITA for appraisal. A successful result confirms the KMS is running genuine TEE code.
Common Mistakes
Setting MERO_KMS_PROFILE to a profile that doesn’t match the node cohort. Nodes will fail key release with a policy mismatch error. Always confirm the profile matches between KMS and the nodes it serves.
Deploying a new KMS binary but using an old MERO_KMS_VERSION tag. The fetched policy won’t include measurements for the new binary. Always bump the version tag alongside the image.
Omitting MERO_KMS_VERSION or MERO_KMS_PROFILE causes the KMS to fail at startup. These are required variables with no defaults.
Using :latest or :v0.3.0 instead of a digest-pinned reference. Tags can be overwritten; digests are immutable. Always use @sha256:....
Forgetting to volume-mount /var/run/dstack.sock into the container. Without it, key derivation and quote generation fail with a connection error.
GCP Node Image Runbook
Deploy and verify Calimero node images on GCP TDX Confidential VMs. Covers release verification, image pinning, runtime measurement validation, and MDMA comparison.
What Is Released
Each mero-tee-vX.Y.Z release includes the following artifacts:
Step 1 — Verify Release Assets
Run the verification script
Validates checksums, signatures, SBOM structure, and confirms published-mrtds.json is well-formed and contains entries for all three profiles.
Inspect the MRTD file
# Expect: per-profile MRTD + RTMR0-3 values
Each profile entry contains the expected mrtd, rtmr0, rtmr1, rtmr2, and rtmr3 hex values. These are the measurements your deployed VMs must match.
Step 2 — Deploy Pinned Image
GCP instance with image family/version pin
--zone=us-central1-a \
--machine-type=n2d-standard-4 \
--confidential-compute-type=TDX \
--image-family=mero-tee-locked-read-only \
--image-project=calimero-prod \
--metadata=MERO_TEE_VERSION=v0.3.0
For maximum reproducibility, pin to a specific image version instead of family:
--image-project=calimero-prod
Baked merod binary
The GCP image includes the merod binary baked at build time by the Packer + Ansible pipeline. The binary version is fixed and cannot be changed without rebuilding the image. This ensures the binary hash is part of the measured MRTD.
MERO_TEE_VERSION metadata
The MERO_TEE_VERSION GCP instance metadata key is optional but recommended. It allows operators and automation to identify which release an instance was created from without inspecting the image itself. It has no effect on measurements or attestation.
Step 3 — Verify Runtime Measurements
Extract runtime measurements
gcloud compute ssh merod-prod-01 -- \
"cat /sys/kernel/security/tdx/measurements"
Compare against published-mrtds.json
./scripts/verify-runtime-mrtds.sh \
--instance merod-prod-01 \
--expected published-mrtds.json \
--profile locked-read-only
The script compares MRTD and all four RTMR registers. Any mismatch indicates the VM is not running the expected code.
MDMA vs release probe comparison
The MDMA (Measurement Data Management Agent) collects measurements from running instances. Compare MDMA-reported values against the release probe values stored in published-mrtds.json. Discrepancies indicate either a different image or unexpected runtime modification.
./scripts/compare-mdma-vs-release.sh \
--instance merod-prod-01 \
--release mero-tee-v0.3.0
Interaction with core tee-mode
The calimero/core merod binary reads the --tee-mode flag at startup. When set, merod performs a KMS challenge-response to obtain its storage encryption key. The node image profile must match the KMS profile for the key release to succeed. The core tee-mode is configured in the calimero/core repository and is independent of the image build.
Common Mistakes
Deploying with --image-family=mero-tee-debug when you intended mero-tee-locked-read-only. The RTMR3 measurement will differ, and the production KMS will reject the node.
The node image profile and the KMS MERO_KMS_PROFILE must agree. A debug node connecting to a locked-read-only KMS will fail key release with a policy mismatch.
Verifying against a published-mrtds.json from an older release. Always download the MRTD file corresponding to the exact image version you deployed.
Image families resolve to the latest version. If a new image is published between planning and deployment, the resolved image may differ. Use explicit --image= for reproducible deployments.
Omitting --confidential-compute-type=TDX creates a standard VM without TDX. The node will not produce valid attestation quotes and will fail KMS challenges.
Blue-Green KMS Rollout
Safe, gated rollout of a new KMS version paired with updated node images. Six decision gates ensure correctness at each stage with rollback branches.
Scope & Inputs
This runbook applies when rolling out a new KMS version that may also require new node images with updated measurements.
Required Inputs
Decision Tree (6 Gates)
Each gate is a go/no-go checkpoint. Failing a gate triggers the associated rollback procedure.
Verify KMS Release Assets
Run verify-release-assets.sh on the new KMS tag. Confirm checksums, signatures, SBOM, and policy files are present and valid.
Verify Node Image Release Assets
Run verify-release-assets.sh on the node-image tag. Confirm published-mrtds.json, checksums, and provenance are valid.
Deploy Staging KMS
Deploy the new KMS image to the staging environment using the digest-pinned reference. Configure with MERO_KMS_VERSION set to the new tag and the target profile.
Probe Staging KMS
Run the staging probe workflow (kms-phala-staging-probe.yaml). Verify attestation quotes, extract measurements, and compare against the expected policy. Confirm a staging node can complete the full challenge → get-key flow.
Promote Policy
Open a PR promoting the staging probe’s measurement candidates to the release policy files. Require two approvals from the security team. Merge only after review.
Deploy Production
Deploy the new KMS to production with the merged policy. Perform a production probe: health check, attestation verification, and a canary key-release from a production node. Monitor for 30 minutes.
Per-Gate Runbook Steps
D1 & D2 — Asset Verification
Download release artifacts
Use gh release download to fetch all assets for both the KMS and node-image tags into separate directories.
Verify checksums and signatures
Run verify-release-assets.sh in each directory. The script exits non-zero on any failure.
Spot-check policy contents
Open the attestation policy JSON and confirm the target profile entry exists with non-empty MRTD and RTMR values.
D3 — Staging Deploy
Deploy staging Compose
Apply the Docker Compose file from the Phala KMS runbook, pointed at the staging CVM. Use the new digest-pinned image.
Confirm health
Poll GET /health until it returns the expected version and profile. Timeout after 120 seconds.
D4 — Staging Probe
Trigger probe workflow
Run kms-phala-staging-probe.yaml via GitHub Actions or manually. The probe deploys a test node, performs a full challenge → get-key cycle, and extracts measurement candidates.
Validate probe output
Confirm the probe outputs match the published MRTDs. Any discrepancy blocks progression.
D5 — Policy Promotion
Create policy PR
The probe can auto-generate a PR updating kms-phala-attestation-policy.{profile}.json. If manual, copy measurement candidates into the policy file and open the PR.
Review and merge
Two security-team approvals required. Verify the diff contains only expected measurement changes.
D6 — Production Deploy
Deploy to production
Update the production Compose file with the new digest and version. Rolling restart the KMS container.
Canary key-release
Verify at least one production node can complete a challenge → get-key cycle with the new KMS.
Monitor (30 min soak)
Watch health endpoint, error rates, and key-release latency. No elevated errors for 30 minutes confirms stable rollout.
Rollback Procedures
Each gate has a defined rollback path. Rollbacks are safe because the old KMS and policy remain in place until D6 completes.
D1/D2 Rollback: No deployment has occurred. Simply abort the rollout and investigate the release artifacts.
D3 Rollback: Tear down the staging KMS container. No production impact.
D4 Rollback: Tear down the staging KMS. Discard measurement candidates. Re-run the probe after fixing the issue.
D5 Rollback: Close the policy PR without merging. The existing production policy remains unchanged.
D6 Rollback: Revert the production Compose file to the previous digest-pinned image. Restart the KMS container. Revert the policy file PR if it was the cause. Verify production nodes can still obtain keys.
Guardrails
- Never skip verification: D1 and D2 must pass before any deployment. Skipping them risks deploying tampered artifacts.
- Always probe staging first: D4 catches measurement mismatches, configuration errors, and policy incompatibilities in a safe environment.
- Policy promotion requires review: D5 is a human gate. Automated promotion is blocked; the PR requires manual security review.
- Production is last: D6 only executes after all other gates pass. Production KMS stays on the old version until the final step.
- Rollback before retry: If a gate fails after partial deployment, complete the rollback before re-attempting.
- Version coupling: If the KMS and node image are released together, both must pass verification (D1 + D2) before either is deployed.