Skip to content

Infrastructure Architecture

This document describes the Hetzner infrastructure where Minnova runs its internal tooling. The architecture prioritizes security (no inbound public access on application servers) and cost efficiency.

Servers

All servers run Debian. The setup uses three servers with distinct roles:

Server Specs IPs / Access Services
headscale CX22 (2 vCPU, 4GB RAM) Public IPv4 159.69.247.237 Headscale VPN coordinator, Traefik reverse proxy (Podman)
bastion CX22 (2 vCPU, 4GB RAM) Private 10.0.1.1 + VPN 100.64.0.6 (public IPv4 outbound only) Tailscale client, SSH jump host
apps CX53 (16 vCPU, 32GB RAM) Private 10.0.2.1 (public IPv4 outbound only) K3s cluster (cloudflared tunnel, Traefik, Authentik, ArgoCD, Grafana/Prometheus/Loki, Portainer, Homepage, Gatus/status, Forgejo, Nextcloud, Umami, Zulip, Kimai, Hoop, Glance)

The headscale server is the only one intended to accept inbound traffic from the public internet (ports 80/443). It runs Traefik as a reverse proxy to handle TLS termination for the Headscale service.

The bastion server acts as an SSH jump host. It sits on the private network but is also connected to the Tailscale mesh network (100.64.0.6). Admins SSH to the bastion first, then hop to other servers. This keeps SSH access off the public internet entirely.

The apps server hosts internal services on a single-node K3s cluster. Public IPv4 is enabled for outbound-only traffic (updates, tunnel egress), but inbound access is blocked by firewall rules. Web traffic reaches apps through Cloudflare Tunnel → Traefik, and SSH access goes through the bastion.

Services

Web services are exposed through Cloudflare Tunnel with automatic HTTPS. The tunnel runs cloudflared on the apps server, which creates outbound-only connections to Cloudflare's edge. This means no inbound ports need to be opened - Cloudflare routes requests through the tunnel to the local services.

Service URL Purpose
Authentik auth.minnova.io Single Sign-On provider for internal services
Headscale headscale.minnova.io VPN coordinator - allows Tailscale clients to join
Grafana grafana.minnova.io Dashboards for metrics and logs
Prometheus prometheus.minnova.io Metrics collection (5-day retention, 15GB limit)
Loki (internal) Log aggregation (7-day retention)
ArgoCD argocd.minnova.io GitOps controller for cluster apps
Portainer portainer.minnova.io K3s management UI
Forgejo forgejo.minnova.io Self-hosted Git with container registry
Gatus status.minnova.io Status page / health checks
Traefik traefik.minnova.io Ingress controller dashboard
Nextcloud nextcloud.minnova.io File sharing and collaboration
Homepage homepage.minnova.io Internal landing / service directory
Glance glance.minnova.io Dashboard with feeds and widgets
Umami analytics.minnova.io Privacy-focused web analytics
Zulip zulip.minnova.io Team chat and communication
Kimai kimai.minnova.io Time tracking
Hoop hoop.minnova.io Secure database/server access gateway

The Oracle knowledge base runs on Cloudflare Pages (with Access), not on the apps cluster.

Authentik is the identity provider for all services. When you log into Headscale or Grafana, you're actually authenticating against Authentik via OIDC. This centralizes user management - add someone to Authentik and they can access all integrated services.

Headscale is the exception to the Cloudflare Tunnel pattern. The Tailscale protocol requires WebSocket POST which tunnels don't support, so Headscale needs a real public IP. It's the coordination server that Tailscale clients connect to when joining the VPN.

Grafana provides dashboards for monitoring. It pulls metrics from Prometheus and logs from Loki, giving visibility into server health and application behavior.

Architecture Diagram

flowchart TB
    subgraph Internet
        Users([Users])
        Admin([Admin])
    end

    subgraph Cloudflare
        Tunnel[Tunnel]
    end

    subgraph Hetzner["Hetzner Cloud"]
        HS[Headscale<br/>159.69.247.237]

        subgraph Private["Private Network 10.0.0.0/16"]
            B[bastion<br/>10.0.1.1]
            Apps[apps<br/>10.0.2.1]
        end
    end

    Users --> Tunnel --> Apps
    Admin --> HS --> B -->|SSH| Apps

Traffic Paths

There are two distinct paths for accessing infrastructure:

Path Use Case Flow
Public Web services (Authentik, Grafana, etc.) Internet → Cloudflare Tunnel → cloudflared on apps → Traefik → K3s service
Admin SSH access for maintenance Tailscale client → Headscale (OIDC auth) → Bastion (100.64.0.6) → SSH to 10.0.x.x

The admin path requires Authentik authentication via OIDC before Headscale grants VPN access. This ensures only authorized team members can SSH into servers.

Stack

Category Tools
IaC OpenTofu, Ansible, SOPS
Orchestration K3s (apps), Podman + systemd (headscale)
Identity Authentik (OIDC)
VPN Headscale + Tailscale
Ingress Cloudflare Tunnel → Traefik
Observability Grafana, Prometheus, Loki
Analytics Umami (self-hosted)
Security CrowdSec (IDS/IPS)

Infrastructure is defined in two layers. OpenTofu (an open-source Terraform fork) manages cloud resources - servers, networks, firewalls, DNS records, GitHub repositories. Ansible configures the servers themselves - installing packages, deploying containers, managing configuration files.

Secrets are handled by SOPS with Age encryption. Secret files live in Git but are encrypted, so they can be version controlled without exposing credentials. For Kubernetes workloads, the SOPS Secrets Operator decrypts secrets in-cluster at deploy time.

Standalone containers (e.g., Traefik on headscale) run under Podman instead of Docker. The main advantage is rootless operation - containers run as a regular user rather than root, reducing the blast radius if a container is compromised. K3s workloads on apps are managed via Kubernetes manifests and Helm.

The observability stack (Grafana, Prometheus, Loki) runs on the apps server. Prometheus scrapes metrics from Node Exporter on each server, while Loki aggregates logs from Alloy. Grafana provides a unified interface to query both. Retention settings:

  • Prometheus: 5-day retention with 15GB size limit (auto-compacts when limit reached)
  • Loki: 7-day retention with 10GB limit

Alerting is handled by Alertmanager with Zulip integration for critical alerts (node health, pod crashes, disk space, PostgreSQL issues).