Storage Strategy¶

This document outlines the storage architecture for the Minnova K3s cluster, including current limitations and planned improvements.

Current State: Local-Path Provisioner¶

The cluster currently uses K3s's built-in local-path provisioner for persistent storage. All PVCs are stored at /var/lib/rancher/k3s/storage/ on the apps server.

/var/lib/rancher/k3s/storage/
├── pvc-xxx_authentik_authentik-pg-1/     # PostgreSQL data
├── pvc-xxx_zulip_zulip-pg-1/             # PostgreSQL data
├── pvc-xxx_nextcloud_nextcloud-nextcloud/ # Nextcloud files
├── pvc-xxx_monitoring_prometheus-*/       # Prometheus TSDB
└── ...

Limitations¶

Issue	Description	Impact
Single-node bound	PVCs are tied to the local disk	Pods can't move to other nodes
No replication	Data exists only on one disk	Disk failure = data loss
No size enforcement	PVCs can grow beyond requested size	Must monitor manually
Rolling updates	Requires 2x resources on same node	Memory spikes during deploys

Current Mitigations¶

CloudNative-PG backups: All PostgreSQL databases backup to Cloudflare R2 (WAL archiving + daily full backups)
Velero/Kopia: Application-level backups to R2
Prometheus retention: Limited to 5 days / 15GB to prevent disk exhaustion

Planned: Hetzner Volumes (CSI Driver)¶

Hetzner Volumes are cloud block storage that can be attached to any server in the same datacenter. Using the Hetzner CSI Driver, Kubernetes can dynamically provision and attach volumes.

Why Hetzner Volumes (Not Longhorn)¶

Consideration	Hetzner Volumes	Longhorn
Complexity	Simple CSI driver	Complex distributed system
Management	Hetzner manages durability	Self-managed replication
Multi-node	Volumes move with pods	Data replicated across nodes
Failover speed	~30-60s (detach/attach)	Near-instant
Cost	€0.052/GB/month	"Free" (uses local disk)
Our DB HA	CloudNative-PG handles replication	Redundant with CNPG

Decision: Hetzner Volumes + CloudNative-PG replication is simpler and sufficient for our scale. Longhorn adds complexity without significant benefit when databases already replicate at the application layer.

How It Works¶

flowchart TB
    subgraph Node1["Node 1 (apps)"]
        Pod1[PostgreSQL Pod]
    end

    subgraph Node2["Node 2 (worker)"]
        Pod2[PostgreSQL Standby]
    end

    subgraph Hetzner["Hetzner Cloud Storage"]
        Vol1[(Volume 1)]
        Vol2[(Volume 2)]
    end

    Pod1 --> Vol1
    Pod2 --> Vol2
    Pod1 <-->|Streaming Replication| Pod2

When a pod moves:

CSI driver detaches volume from old node
CSI driver attaches volume to new node
Pod starts with data intact (~30-60s total)

For databases, CloudNative-PG standby takes over instantly while primary recovers.

Volume Characteristics¶

Persistent: Survives server reboots, upgrades, even server deletion
Portable: Can detach and attach to different servers
Scalable: Resize without downtime
Durable: Hetzner manages redundancy at storage layer
Limitation: Single-attach only (one node at a time)

Planned: Hetzner Storage Box (NFS)¶

Hetzner Storage Box provides NFS-accessible network storage, separate from VPS instances. Useful for data that doesn't need block storage performance.

Use Cases¶

Use Case	Why NFS Works
Nextcloud files	Large files, shared access okay
Media storage	Read-heavy, write-light
Backup targets	Sequential writes, not latency-sensitive
Shared configs	Read by multiple pods

When NOT to Use NFS¶

Databases (PostgreSQL, MySQL) - Need block storage for ACID guarantees
Redis/caching - Latency-sensitive
High-IOPS workloads - Network latency adds up

Storage Classes After Migration¶

StorageClass	Backend	Use Case	Notes
`hcloud-volumes`	Hetzner Volumes	Databases, stateful apps	Default
`nfs`	Hetzner Storage Box	Large files, Nextcloud	Shared access
`local-path`	Local disk	Ephemeral, non-critical	Keep for testing

Migration Plan¶

Phase 1: Install Hetzner CSI Driver¶

Create Hetzner API token with volume permissions

Deploy CSI driver via Helm:

helm repo add hcloud https://charts.hetzner.cloud
helm install hcloud-csi hcloud/hcloud-csi -n kube-system

Verify hcloud-volumes StorageClass is available
Test with non-critical workload

Phase 2: Migrate Databases¶

For each CloudNative-PG cluster:

Verify backup is current (kubectl cnpg status <cluster>)
Update cluster spec to use hcloud-volumes StorageClass
Let CNPG handle migration via backup/restore
Verify replication is healthy

Phase 3: Migrate Other PVCs¶

Prometheus: Delete and recreate (historical data not critical, 5-day retention)
Grafana: Export dashboards, recreate PVC
Nextcloud files: Migrate to NFS (Storage Box)
Vaultwarden: Backup, recreate with Hetzner Volume

Phase 4: Decommission local-path (Optional)¶

Verify all critical workloads on Hetzner Volumes/NFS
Keep local-path for non-critical workloads (lower cost)
Clean up old PVC directories

Backup Strategy¶

Backups remain multi-layered:

┌─────────────────────────────────────────────────┐
│  Application Layer                              │
│  - CloudNative-PG → R2 (WAL + daily backups)   │
│  - Velero/Kopia → R2 (app snapshots)           │
└─────────────────────────────────────────────────┘
                      +
┌─────────────────────────────────────────────────┐
│  Storage Layer                                  │
│  - Hetzner Volumes (Hetzner-managed durability)│
│  - Storage Box (separate from compute)         │
└─────────────────────────────────────────────────┘
                      +
┌─────────────────────────────────────────────────┐
│  Disaster Recovery                              │
│  - All backups in Cloudflare R2                │
│  - Cross-region by default                     │
└─────────────────────────────────────────────────┘

Cost Estimate¶

Storage	Size	Monthly Cost
Hetzner Volumes (DBs)	~50GB	~€2.60
Storage Box BX11 (files)	1TB	€3.81
Total		~€6.50/month

Orchestration - K3s components and storage
Multi-Node Scaling - Adding worker nodes
Implementation Status - Phase 7 tracking