Skip to content

Backup & Restore

This guide covers backup strategies for the Talos Kubernetes cluster.

What to Backup

Component Method Frequency
Etcd Talos snapshot Daily
Talos configs Git (talhelper) On change
Kubernetes manifests Git (Flux) On change
Secrets (SOPS) Git (encrypted) On change
PVCs Velero/Restic Daily

Etcd Backup

Etcd contains all Kubernetes state. This is the most critical backup.

Manual Snapshot

# Create snapshot from any control plane node
talosctl -n 172.16.1.50 etcd snapshot db.snapshot

# Verify snapshot
ls -la db.snapshot

Automated Backup Script

Create a cron job or scheduled task:

#!/bin/bash
BACKUP_DIR="/backups/etcd"
DATE=$(date +%Y%m%d-%H%M%S)
NODES="172.16.1.50"

mkdir -p $BACKUP_DIR
talosctl -n $NODES etcd snapshot $BACKUP_DIR/etcd-$DATE.snapshot

# Keep last 7 days
find $BACKUP_DIR -name "*.snapshot" -mtime +7 -delete

Restore from Etcd Snapshot

Destructive Operation

This will reset the cluster. Only use for disaster recovery.

# On each control plane node
talosctl -n 172.16.1.50 bootstrap --recover-from=./db.snapshot

Talos Configuration Backup

Talos machine configs are generated by talhelper and stored in Git.

Critical Files

talos/
├── talconfig.yaml          # Cluster definition
├── talsecret.sops.yaml     # Encrypted secrets
└── clusterconfig/          # Generated configs (gitignored)
    └── .gitignore

Backup Secrets

The talsecret.sops.yaml contains encrypted cluster secrets. Ensure you have:

  1. Age private key backed up securely (password manager)
  2. SOPS config (.sops.yaml) in Git

Regenerate Configs

If you lose the generated configs:

# Regenerate from talconfig.yaml
talhelper genconfig

# Or use task
task talos:generate

Kubernetes State Backup

All Kubernetes manifests are in Git and managed by Flux.

Full Restore from Git

# Clone the repository
git clone https://github.com/sagaragas/k3s-homelab.git

# Bootstrap Flux
kubectl apply -f kubernetes/flux-system/

Secrets Recovery

Secrets are encrypted with SOPS. To decrypt:

# Set the age key
export SOPS_AGE_KEY_FILE=/path/to/age.key

# Decrypt a secret
sops -d kubernetes/apps/*/secret.sops.yaml

PVC Backup with Velero (Optional)

For persistent data in PVCs, consider Velero:

Install Velero

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: velero
  namespace: velero
spec:
  chart:
    spec:
      chart: velero
      sourceRef:
        kind: HelmRepository
        name: velero
  values:
    configuration:
      backupStorageLocation:
        bucket: velero-backups
        provider: aws  # or other provider
      volumeSnapshotLocation:
        provider: csi
    snapshotsEnabled: true

Create Backup

# Backup a namespace
velero backup create my-backup --include-namespaces default

# Backup entire cluster
velero backup create full-backup

Restore

velero restore create --from-backup my-backup

Disaster Recovery Procedures

Scenario 1: Single Node Failure

See Node Failure Runbook

Scenario 2: Complete Cluster Loss

  1. Provision new VMs with Talos ISO
  2. Apply Talos configs from backup
  3. Bootstrap etcd from snapshot (or fresh if no snapshot)
  4. Install Flux to restore workloads
# Fresh bootstrap
talosctl bootstrap -n 172.16.1.50

# Get kubeconfig
talosctl kubeconfig

# Apply Flux
kubectl apply -f kubernetes/flux-system/

Scenario 3: Corrupted Etcd

# Stop etcd on all control planes
talosctl -n 172.16.1.50,51,52 service etcd stop

# Restore from snapshot on first node
talosctl -n 172.16.1.50 bootstrap --recover-from=./db.snapshot

# Other nodes will rejoin automatically

Backup Checklist

  • [ ] Age private key stored in password manager
  • [ ] Etcd snapshots automated (daily)
  • [ ] Git repository has all manifests
  • [ ] SOPS-encrypted secrets committed
  • [ ] Tested restore procedure
  • [ ] Offsite backup copy (3-2-1 rule)

Testing Backups

Regularly test your backups:

# 1. Create test namespace
kubectl create ns backup-test

# 2. Deploy test app
kubectl -n backup-test run nginx --image=nginx

# 3. Backup
talosctl -n 172.16.1.50 etcd snapshot test-backup.snapshot

# 4. Delete and restore (in test environment only!)