Config

Successful enforcement checklist

This page is the self-serve runbook. If every item here is green, enforcement will be reliable.

What Voidburn changes in your cluster

  • Creates an IRSA role + policy bound to the Sentinel service account.
  • Deploys the Sentinel agent in the namespace you choose.
  • Optionally creates a protected control nodegroup (recommended for GPU-only fleets).
  • Targets only nodes labeled voidburn.com/target=true.

Success requirements

  • Nodes Ready: all target nodes show Ready in the cluster.
  • Core add-ons: vpc-cni, coredns, kube-proxy are running.
  • OIDC enabled: IRSA works for the Sentinel service account.
  • Agent heartbeat: cluster shows Online in Active Sectors.
  • Allow-list: only nodes labeled voidburn.com/target=true are enforceable.
  • Agent host: a CPU node is protected (or a control nodegroup is created).
  • Protected label: agent runs on a node labeled voidburn.com/protected=true.
  • ASG min/desired: termination must not violate ASG minimum (min < desired).
  • IAM permissions: pricing:GetProducts, ec2:CreateSnapshots, ec2:CreateTags, autoscaling:TerminateInstanceInAutoScalingGroup.
  • Checkpoint marker (paid tiers): strict mode blocks until marker is confirmed.
  • Snapshots: EBS snapshot creation must succeed (no SCP/boundary deny).

Preflight commands

# Agent can read cluster identity
kubectl auth can-i get namespaces --as=system:serviceaccount:<namespace>:sentinel-sentinel

# IRSA role annotation
kubectl -n <namespace> get sa sentinel-sentinel -o jsonpath={.metadata.annotations.eks.amazonaws.com/role-arn}{
}

# Cluster OIDC issuer
aws eks describe-cluster --name <cluster> --region <region> --query cluster.identity.oidc.issuer --output text

Allow-list targeting

Only labeled nodes are enforceable. Everything else is safe by default.

# Target a nodegroup
kubectl label nodes -l eks.amazonaws.com/nodegroup=<nodegroup> voidburn.com/target=true --overwrite

# Protect a node (agent host)
kubectl label node <node> voidburn.com/protected=true --overwrite

Checkpoint (resumable)

Strict mode blocks termination until a fresh checkpoint marker is observed (timestamp within the checkpoint window). Your workload must checkpoint to disk (PVC/EFS/EBS) and publish the marker only after the checkpoint write succeeds.

  1. Workload writes checkpoint to persistent storage (schedule + SIGTERM).
  2. After the write succeeds: update ConfigMap voidburn-checkpoint key last_checkpoint.
  3. Optional automation: Ops → Safety → Checkpoint confirmation → enable Checkpoint trigger and use the Voidburn receiver.
  4. Set Checkpoint command + secret, then Save. Sentinel auto-creates the in-cluster receiver.
# Marker (workload-owned)
ts="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
kubectl -n <workload-namespace> create configmap voidburn-checkpoint \
  --from-literal=last_checkpoint="" \
  --dry-run=client -o yaml | kubectl apply -f -
kubectl -n <workload-namespace> patch configmap voidburn-checkpoint --type merge \
  -p "{\"data\":{\"last_checkpoint\":\"$ts\"}}"

# Receiver URL (auto-created by Sentinel when enabled)
http://voidburn-checkpoint.<agent-namespace>.svc.cluster.local:8080/voidburn

Details and RBAC: Checkpointing guide

Resume after termination

Voidburn stops compute. To resume, start your workload again. It will load from its PVC/EFS checkpoint if your app writes checkpoints to disk.

# Deployment
kubectl -n <namespace> rollout restart deploy/<name>

# StatefulSet
kubectl -n <namespace> rollout restart statefulset/<name>

# Job (re-run)
kubectl -n <namespace> delete job <name> --ignore-not-found
kubectl -n <namespace> apply -f <job.yaml>

If enforcement stalls

  • ASG min/desired blocked termination → lower min or allow decrement.
  • Node is protected or not allow-listed → remove labels and retry.
  • Snapshot failed → check IAM and SCP/boundary policies.
  • Agent evicted → add a protected control nodegroup.