Config

Successful enforcement checklist

This page is the self-serve runbook. If every item here is green, enforcement will be reliable.

What Voidburn changes in your cluster

Creates an IRSA role + policy bound to the Sentinel service account.
Deploys the Sentinel agent in the namespace you choose.
Optionally creates a protected control nodegroup (recommended for GPU-only fleets).
Targets only nodes labeled voidburn.com/target=true.

Success requirements

Nodes Ready: all target nodes show Ready in the cluster.
Core add-ons: vpc-cni, coredns, kube-proxy are running.
OIDC enabled: IRSA works for the Sentinel service account.
Agent heartbeat: cluster shows Online in Active Sectors.
Allow-list: only nodes labeled voidburn.com/target=true are enforceable.
Agent host: a CPU node is protected (or a control nodegroup is created).
Protected label: agent runs on a node labeled voidburn.com/protected=true.
ASG min/desired: termination must not violate ASG minimum (min < desired).
IAM permissions: pricing:GetProducts, ec2:CreateSnapshots, ec2:CreateTags, autoscaling:TerminateInstanceInAutoScalingGroup.
Checkpoint marker (paid tiers): strict mode blocks until marker is confirmed.
Snapshots: EBS snapshot creation must succeed (no SCP/boundary deny).

Preflight commands

# Agent can read cluster identity
kubectl auth can-i get namespaces --as=system:serviceaccount:<namespace>:sentinel-sentinel

# IRSA role annotation
kubectl -n <namespace> get sa sentinel-sentinel -o jsonpath={.metadata.annotations.eks.amazonaws.com/role-arn}{
}

# Cluster OIDC issuer
aws eks describe-cluster --name <cluster> --region <region> --query cluster.identity.oidc.issuer --output text

Allow-list targeting

Only labeled nodes are enforceable. Everything else is safe by default.

# Target a nodegroup
kubectl label nodes -l eks.amazonaws.com/nodegroup=<nodegroup> voidburn.com/target=true --overwrite

# Protect a node (agent host)
kubectl label node <node> voidburn.com/protected=true --overwrite

Checkpoint (resumable)

Strict mode blocks termination until a fresh checkpoint marker is observed (timestamp within the checkpoint window). Your workload must checkpoint to disk (PVC/EFS/EBS) and publish the marker only after the checkpoint write succeeds.

Workload writes checkpoint to persistent storage (schedule + SIGTERM).
After the write succeeds: update ConfigMap voidburn-checkpoint key last_checkpoint.
Optional automation: Ops → Safety → Checkpoint confirmation → enable Checkpoint trigger and use the Voidburn receiver.
Set Checkpoint command + secret, then Save. Sentinel auto-creates the in-cluster receiver.

# Marker (workload-owned)
ts="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
kubectl -n <workload-namespace> create configmap voidburn-checkpoint \
  --from-literal=last_checkpoint="" \
  --dry-run=client -o yaml | kubectl apply -f -
kubectl -n <workload-namespace> patch configmap voidburn-checkpoint --type merge \
  -p "{\"data\":{\"last_checkpoint\":\"$ts\"}}"

# Receiver URL (auto-created by Sentinel when enabled)
http://voidburn-checkpoint.<agent-namespace>.svc.cluster.local:8080/voidburn

Details and RBAC: Checkpointing guide

Resume after termination

Voidburn stops compute. To resume, start your workload again. It will load from its PVC/EFS checkpoint if your app writes checkpoints to disk.

# Deployment
kubectl -n <namespace> rollout restart deploy/<name>

# StatefulSet
kubectl -n <namespace> rollout restart statefulset/<name>

# Job (re-run)
kubectl -n <namespace> delete job <name> --ignore-not-found
kubectl -n <namespace> apply -f <job.yaml>

If enforcement stalls

ASG min/desired blocked termination → lower min or allow decrement.
Node is protected or not allow-listed → remove labels and retry.
Snapshot failed → check IAM and SCP/boundary policies.
Agent evicted → add a protected control nodegroup.