Config
Successful enforcement checklist
This page is the self-serve runbook. If every item here is green, enforcement will be reliable.
What Voidburn changes in your cluster
- Creates an IRSA role + policy bound to the Sentinel service account.
- Deploys the Sentinel agent in the namespace you choose.
- Optionally creates a protected control nodegroup (recommended for GPU-only fleets).
- Targets only nodes labeled voidburn.com/target=true.
Success requirements
- Nodes Ready: all target nodes show Ready in the cluster.
- Core add-ons: vpc-cni, coredns, kube-proxy are running.
- OIDC enabled: IRSA works for the Sentinel service account.
- Agent heartbeat: cluster shows Online in Active Sectors.
- Allow-list: only nodes labeled voidburn.com/target=true are enforceable.
- Agent host: a CPU node is protected (or a control nodegroup is created).
- Protected label: agent runs on a node labeled voidburn.com/protected=true.
- ASG min/desired: termination must not violate ASG minimum (min < desired).
- IAM permissions: pricing:GetProducts, ec2:CreateSnapshots, ec2:CreateTags, autoscaling:TerminateInstanceInAutoScalingGroup.
- Checkpoint marker (paid tiers): strict mode blocks until marker is confirmed.
- Snapshots: EBS snapshot creation must succeed (no SCP/boundary deny).
Preflight commands
# Agent can read cluster identity
kubectl auth can-i get namespaces --as=system:serviceaccount:<namespace>:sentinel-sentinel
# IRSA role annotation
kubectl -n <namespace> get sa sentinel-sentinel -o jsonpath={.metadata.annotations.eks.amazonaws.com/role-arn}{
}
# Cluster OIDC issuer
aws eks describe-cluster --name <cluster> --region <region> --query cluster.identity.oidc.issuer --output textAllow-list targeting
Only labeled nodes are enforceable. Everything else is safe by default.
# Target a nodegroup kubectl label nodes -l eks.amazonaws.com/nodegroup=<nodegroup> voidburn.com/target=true --overwrite # Protect a node (agent host) kubectl label node <node> voidburn.com/protected=true --overwrite
Checkpoint (resumable)
Strict mode blocks termination until a fresh checkpoint marker is observed (timestamp within the checkpoint window). Your workload must checkpoint to disk (PVC/EFS/EBS) and publish the marker only after the checkpoint write succeeds.
- Workload writes checkpoint to persistent storage (schedule + SIGTERM).
- After the write succeeds: update ConfigMap voidburn-checkpoint key last_checkpoint.
- Optional automation: Ops → Safety → Checkpoint confirmation → enable Checkpoint trigger and use the Voidburn receiver.
- Set Checkpoint command + secret, then Save. Sentinel auto-creates the in-cluster receiver.
# Marker (workload-owned)
ts="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
kubectl -n <workload-namespace> create configmap voidburn-checkpoint \
--from-literal=last_checkpoint="" \
--dry-run=client -o yaml | kubectl apply -f -
kubectl -n <workload-namespace> patch configmap voidburn-checkpoint --type merge \
-p "{\"data\":{\"last_checkpoint\":\"$ts\"}}"
# Receiver URL (auto-created by Sentinel when enabled)
http://voidburn-checkpoint.<agent-namespace>.svc.cluster.local:8080/voidburnDetails and RBAC: Checkpointing guide
Resume after termination
Voidburn stops compute. To resume, start your workload again. It will load from its PVC/EFS checkpoint if your app writes checkpoints to disk.
# Deployment kubectl -n <namespace> rollout restart deploy/<name> # StatefulSet kubectl -n <namespace> rollout restart statefulset/<name> # Job (re-run) kubectl -n <namespace> delete job <name> --ignore-not-found kubectl -n <namespace> apply -f <job.yaml>
If enforcement stalls
- ASG min/desired blocked termination → lower min or allow decrement.
- Node is protected or not allow-listed → remove labels and retry.
- Snapshot failed → check IAM and SCP/boundary policies.
- Agent evicted → add a protected control nodegroup.