Backup & Restore¶
This guide covers backup strategies for the Talos Kubernetes cluster.
What to Backup¶
| Component | Method | Frequency |
|---|---|---|
| Etcd | Talos snapshot | Daily |
| Talos configs | Git (talhelper) | On change |
| Kubernetes manifests | Git (Flux) | On change |
| Secrets (SOPS) | Git (encrypted) | On change |
| PVCs | Velero/Restic | Daily |
Etcd Backup¶
Etcd contains all Kubernetes state. This is the most critical backup.
Manual Snapshot¶
# Create snapshot from any control plane node
talosctl -n 172.16.1.50 etcd snapshot db.snapshot
# Verify snapshot
ls -la db.snapshot
Automated Backup Script¶
Create a cron job or scheduled task:
#!/bin/bash
BACKUP_DIR="/backups/etcd"
DATE=$(date +%Y%m%d-%H%M%S)
NODES="172.16.1.50"
mkdir -p $BACKUP_DIR
talosctl -n $NODES etcd snapshot $BACKUP_DIR/etcd-$DATE.snapshot
# Keep last 7 days
find $BACKUP_DIR -name "*.snapshot" -mtime +7 -delete
Restore from Etcd Snapshot¶
Destructive Operation
This will reset the cluster. Only use for disaster recovery.
Talos Configuration Backup¶
Talos machine configs are generated by talhelper and stored in Git.
Critical Files¶
talos/
├── talconfig.yaml # Cluster definition
├── talsecret.sops.yaml # Encrypted secrets
└── clusterconfig/ # Generated configs (gitignored)
└── .gitignore
Backup Secrets¶
The talsecret.sops.yaml contains encrypted cluster secrets. Ensure you have:
- Age private key backed up securely (password manager)
- SOPS config (
.sops.yaml) in Git
Regenerate Configs¶
If you lose the generated configs:
Kubernetes State Backup¶
All Kubernetes manifests are in Git and managed by Flux.
Full Restore from Git¶
# Clone the repository
git clone https://github.com/sagaragas/k3s-homelab.git
# Bootstrap Flux
kubectl apply -f kubernetes/flux-system/
Secrets Recovery¶
Secrets are encrypted with SOPS. To decrypt:
# Set the age key
export SOPS_AGE_KEY_FILE=/path/to/age.key
# Decrypt a secret
sops -d kubernetes/apps/*/secret.sops.yaml
PVC Backup with Velero (Optional)¶
For persistent data in PVCs, consider Velero:
Install Velero¶
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: velero
namespace: velero
spec:
chart:
spec:
chart: velero
sourceRef:
kind: HelmRepository
name: velero
values:
configuration:
backupStorageLocation:
bucket: velero-backups
provider: aws # or other provider
volumeSnapshotLocation:
provider: csi
snapshotsEnabled: true
Create Backup¶
# Backup a namespace
velero backup create my-backup --include-namespaces default
# Backup entire cluster
velero backup create full-backup
Restore¶
Disaster Recovery Procedures¶
Scenario 1: Single Node Failure¶
Scenario 2: Complete Cluster Loss¶
- Provision new VMs with Talos ISO
- Apply Talos configs from backup
- Bootstrap etcd from snapshot (or fresh if no snapshot)
- Install Flux to restore workloads
# Fresh bootstrap
talosctl bootstrap -n 172.16.1.50
# Get kubeconfig
talosctl kubeconfig
# Apply Flux
kubectl apply -f kubernetes/flux-system/
Scenario 3: Corrupted Etcd¶
# Stop etcd on all control planes
talosctl -n 172.16.1.50,51,52 service etcd stop
# Restore from snapshot on first node
talosctl -n 172.16.1.50 bootstrap --recover-from=./db.snapshot
# Other nodes will rejoin automatically
Backup Checklist¶
- [ ] Age private key stored in password manager
- [ ] Etcd snapshots automated (daily)
- [ ] Git repository has all manifests
- [ ] SOPS-encrypted secrets committed
- [ ] Tested restore procedure
- [ ] Offsite backup copy (3-2-1 rule)
Testing Backups¶
Regularly test your backups: