Backup & Restore¶
This guide covers backup strategies for the Talos Kubernetes cluster.
What to Backup¶
| Component | Method | Frequency |
|---|---|---|
| Etcd | Talos snapshot | Daily |
| Talos version pins | Git (talos/talenv.yaml) |
On change |
| Talos client config | Backup ~/.talos/config |
On change |
| Kubernetes manifests | Git (Flux) | On change |
| Secrets (SOPS) | Git (encrypted) | On change |
| PVCs | Velero (CSI snapshots) | Daily |
Etcd Backup¶
Etcd contains all Kubernetes state. This is the most critical backup.
Manual Snapshot¶
# Create snapshot from any control plane node
talosctl -n 172.16.1.50 etcd snapshot db.snapshot
# Verify snapshot
ls -la db.snapshot
Automated Backup Script¶
Create a cron job or scheduled task:
#!/bin/bash
BACKUP_DIR="/backups/etcd"
DATE=$(date +%Y%m%d-%H%M%S)
NODES="172.16.1.50"
mkdir -p $BACKUP_DIR
talosctl -n $NODES etcd snapshot $BACKUP_DIR/etcd-$DATE.snapshot
# Keep last 7 days
find $BACKUP_DIR -name "*.snapshot" -mtime +7 -delete
Restore from Etcd Snapshot¶
Destructive Operation
This will reset the cluster. Only use for disaster recovery.
Talos Configuration Backup¶
Talos configuration is managed with talhelper (see task talos:*). The repo tracks version pins and cluster definition; per-node machine configs are generated into talos/clusterconfig/ (gitignored).
Critical Files¶
talos/
├── talenv.yaml # Version pinning
├── talconfig.yaml # Cluster definition (talhelper)
├── patches/ # Reusable patch snippets
└── clusterconfig/ # Generated machine configs (gitignored)
Export machine configs (optional)¶
If you want an out-of-band backup of the live machine configs, export them from Talos and store them securely (do not commit them to Git):
Kubernetes State Backup¶
All Kubernetes manifests are in Git and managed by Flux.
Full Restore from Git¶
Bootstrap the cluster base components + Flux (operator/instance) using the repo bootstrap tooling (see ./scripts/bootstrap-apps.sh and bootstrap/helmfile.d/).
Secrets Recovery¶
Secrets are encrypted with SOPS. To decrypt:
# Set the age key
export SOPS_AGE_KEY_FILE=/path/to/age.key
# Decrypt a secret
sops -d kubernetes/apps/*/secret.sops.yaml
PVC Backup with Velero¶
Velero is deployed in this cluster and configured to back up to an in-cluster S3-compatible store (Minio) and take CSI snapshots.
Create Backup¶
# Backup a namespace
velero backup create my-backup --include-namespaces default
# Backup entire cluster
velero backup create full-backup
Restore¶
Disaster Recovery Procedures¶
Scenario 1: Single Node Failure¶
Scenario 2: Complete Cluster Loss¶
- Provision new VMs with Talos ISO
- Apply Talos configs from backup
- Bootstrap etcd from snapshot (or fresh if no snapshot)
- Install Flux to restore workloads
# Fresh bootstrap
talosctl bootstrap -n 172.16.1.50
# Get kubeconfig
talosctl kubeconfig
# Bootstrap base apps + Flux (operator/instance)
./scripts/bootstrap-apps.sh
Scenario 3: Corrupted Etcd¶
# Stop etcd on all control planes
talosctl -n 172.16.1.50,51,52 service etcd stop
# Restore from snapshot on first node
talosctl -n 172.16.1.50 bootstrap --recover-from=./db.snapshot
# Other nodes will rejoin automatically
Backup Checklist¶
- [ ] Age private key stored in password manager
- [ ] Etcd snapshots automated (daily)
- [ ] Git repository has all manifests
- [ ] SOPS-encrypted secrets committed
- [ ] Tested restore procedure
- [ ] Offsite backup copy (3-2-1 rule)
Testing Backups¶
Regularly test your backups: