Skip to content

Storage Strategy

This document outlines the storage architecture for the Minnova K3s cluster, including current limitations and planned improvements.

Current State: Local-Path Provisioner

The cluster currently uses K3s's built-in local-path provisioner for persistent storage. All PVCs are stored at /var/lib/rancher/k3s/storage/ on the apps server.

/var/lib/rancher/k3s/storage/
├── pvc-xxx_authentik_authentik-pg-1/     # PostgreSQL data
├── pvc-xxx_zulip_zulip-pg-1/             # PostgreSQL data
├── pvc-xxx_nextcloud_nextcloud-nextcloud/ # Nextcloud files
├── pvc-xxx_monitoring_prometheus-*/       # Prometheus TSDB
└── ...

Limitations

Issue Description Impact
Single-node bound PVCs are tied to the local disk Pods can't move to other nodes
No replication Data exists only on one disk Disk failure = data loss
No size enforcement PVCs can grow beyond requested size Must monitor manually
Rolling updates Requires 2x resources on same node Memory spikes during deploys

Current Mitigations

  • CloudNative-PG backups: All PostgreSQL databases backup to Cloudflare R2 (WAL archiving + daily full backups)
  • Velero/Kopia: Application-level backups to R2
  • Prometheus retention: Limited to 5 days / 15GB to prevent disk exhaustion

Planned: Hetzner Volumes (CSI Driver)

Hetzner Volumes are cloud block storage that can be attached to any server in the same datacenter. Using the Hetzner CSI Driver, Kubernetes can dynamically provision and attach volumes.

Why Hetzner Volumes (Not Longhorn)

Consideration Hetzner Volumes Longhorn
Complexity Simple CSI driver Complex distributed system
Management Hetzner manages durability Self-managed replication
Multi-node Volumes move with pods Data replicated across nodes
Failover speed ~30-60s (detach/attach) Near-instant
Cost €0.052/GB/month "Free" (uses local disk)
Our DB HA CloudNative-PG handles replication Redundant with CNPG

Decision: Hetzner Volumes + CloudNative-PG replication is simpler and sufficient for our scale. Longhorn adds complexity without significant benefit when databases already replicate at the application layer.

How It Works

flowchart TB
    subgraph Node1["Node 1 (apps)"]
        Pod1[PostgreSQL Pod]
    end

    subgraph Node2["Node 2 (worker)"]
        Pod2[PostgreSQL Standby]
    end

    subgraph Hetzner["Hetzner Cloud Storage"]
        Vol1[(Volume 1)]
        Vol2[(Volume 2)]
    end

    Pod1 --> Vol1
    Pod2 --> Vol2
    Pod1 <-->|Streaming Replication| Pod2

When a pod moves:

  1. CSI driver detaches volume from old node
  2. CSI driver attaches volume to new node
  3. Pod starts with data intact (~30-60s total)

For databases, CloudNative-PG standby takes over instantly while primary recovers.

Volume Characteristics

  • Persistent: Survives server reboots, upgrades, even server deletion
  • Portable: Can detach and attach to different servers
  • Scalable: Resize without downtime
  • Durable: Hetzner manages redundancy at storage layer
  • Limitation: Single-attach only (one node at a time)

Planned: Hetzner Storage Box (NFS)

Hetzner Storage Box provides NFS-accessible network storage, separate from VPS instances. Useful for data that doesn't need block storage performance.

Use Cases

Use Case Why NFS Works
Nextcloud files Large files, shared access okay
Media storage Read-heavy, write-light
Backup targets Sequential writes, not latency-sensitive
Shared configs Read by multiple pods

When NOT to Use NFS

  • Databases (PostgreSQL, MySQL) - Need block storage for ACID guarantees
  • Redis/caching - Latency-sensitive
  • High-IOPS workloads - Network latency adds up

Storage Classes After Migration

StorageClass Backend Use Case Notes
hcloud-volumes Hetzner Volumes Databases, stateful apps Default
nfs Hetzner Storage Box Large files, Nextcloud Shared access
local-path Local disk Ephemeral, non-critical Keep for testing

Migration Plan

Phase 1: Install Hetzner CSI Driver

  1. Create Hetzner API token with volume permissions
  2. Deploy CSI driver via Helm:
    helm repo add hcloud https://charts.hetzner.cloud
    helm install hcloud-csi hcloud/hcloud-csi -n kube-system
    
  3. Verify hcloud-volumes StorageClass is available
  4. Test with non-critical workload

Phase 2: Migrate Databases

For each CloudNative-PG cluster:

  1. Verify backup is current (kubectl cnpg status <cluster>)
  2. Update cluster spec to use hcloud-volumes StorageClass
  3. Let CNPG handle migration via backup/restore
  4. Verify replication is healthy

Phase 3: Migrate Other PVCs

  • Prometheus: Delete and recreate (historical data not critical, 5-day retention)
  • Grafana: Export dashboards, recreate PVC
  • Nextcloud files: Migrate to NFS (Storage Box)
  • Vaultwarden: Backup, recreate with Hetzner Volume

Phase 4: Decommission local-path (Optional)

  1. Verify all critical workloads on Hetzner Volumes/NFS
  2. Keep local-path for non-critical workloads (lower cost)
  3. Clean up old PVC directories

Backup Strategy

Backups remain multi-layered:

┌─────────────────────────────────────────────────┐
│  Application Layer                              │
│  - CloudNative-PG → R2 (WAL + daily backups)   │
│  - Velero/Kopia → R2 (app snapshots)           │
└─────────────────────────────────────────────────┘
                      +
┌─────────────────────────────────────────────────┐
│  Storage Layer                                  │
│  - Hetzner Volumes (Hetzner-managed durability)│
│  - Storage Box (separate from compute)         │
└─────────────────────────────────────────────────┘
                      +
┌─────────────────────────────────────────────────┐
│  Disaster Recovery                              │
│  - All backups in Cloudflare R2                │
│  - Cross-region by default                     │
└─────────────────────────────────────────────────┘

Cost Estimate

Storage Size Monthly Cost
Hetzner Volumes (DBs) ~50GB ~€2.60
Storage Box BX11 (files) 1TB €3.81
Total ~€6.50/month