Infrastructure Architecture¶

This document describes the Hetzner infrastructure where Minnova runs its internal tooling. The architecture prioritizes security (no inbound public access on application servers) and cost efficiency.

Servers¶

All servers run Debian. The setup uses three servers with distinct roles:

Server	Specs	IPs / Access	Services
headscale	CX22 (2 vCPU, 4GB RAM)	Public IPv4 `159.69.247.237`	Headscale VPN coordinator, Traefik reverse proxy (Podman)
bastion	CX22 (2 vCPU, 4GB RAM)	Private `10.0.1.1` + VPN `100.64.0.6` (public IPv4 outbound only)	Tailscale client, SSH jump host
apps	CX53 (16 vCPU, 32GB RAM)	Private `10.0.2.1` (public IPv4 outbound only)	K3s cluster (cloudflared tunnel, Traefik, Authentik, ArgoCD, Grafana/Prometheus/Loki, Portainer, Homepage, Gatus/status, Forgejo, Nextcloud, Umami, Zulip, Kimai, Hoop, Glance)

The headscale server is the only one intended to accept inbound traffic from the public internet (ports 80/443). It runs Traefik as a reverse proxy to handle TLS termination for the Headscale service.

The bastion server acts as an SSH jump host. It sits on the private network but is also connected to the Tailscale mesh network (100.64.0.6). Admins SSH to the bastion first, then hop to other servers. This keeps SSH access off the public internet entirely.

The apps server hosts internal services on a single-node K3s cluster. Public IPv4 is enabled for outbound-only traffic (updates, tunnel egress), but inbound access is blocked by firewall rules. Web traffic reaches apps through Cloudflare Tunnel → Traefik, and SSH access goes through the bastion.

Services¶

Web services are exposed through Cloudflare Tunnel with automatic HTTPS. The tunnel runs cloudflared on the apps server, which creates outbound-only connections to Cloudflare's edge. This means no inbound ports need to be opened - Cloudflare routes requests through the tunnel to the local services.

Service	URL	Purpose
Authentik	auth.minnova.io	Single Sign-On provider for internal services
Headscale	headscale.minnova.io	VPN coordinator - allows Tailscale clients to join
Grafana	grafana.minnova.io	Dashboards for metrics and logs
Prometheus	prometheus.minnova.io	Metrics collection (5-day retention, 15GB limit)
Loki	(internal)	Log aggregation (7-day retention)
ArgoCD	argocd.minnova.io	GitOps controller for cluster apps
Portainer	portainer.minnova.io	K3s management UI
Forgejo	forgejo.minnova.io	Self-hosted Git with container registry
Gatus	status.minnova.io	Status page / health checks
Traefik	traefik.minnova.io	Ingress controller dashboard
Nextcloud	nextcloud.minnova.io	File sharing and collaboration
Homepage	homepage.minnova.io	Internal landing / service directory
Glance	glance.minnova.io	Dashboard with feeds and widgets
Umami	analytics.minnova.io	Privacy-focused web analytics
Zulip	zulip.minnova.io	Team chat and communication
Kimai	kimai.minnova.io	Time tracking
Hoop	hoop.minnova.io	Secure database/server access gateway

The Oracle knowledge base runs on Cloudflare Pages (with Access), not on the apps cluster.

Authentik is the identity provider for all services. When you log into Headscale or Grafana, you're actually authenticating against Authentik via OIDC. This centralizes user management - add someone to Authentik and they can access all integrated services.

Headscale is the exception to the Cloudflare Tunnel pattern. The Tailscale protocol requires WebSocket POST which tunnels don't support, so Headscale needs a real public IP. It's the coordination server that Tailscale clients connect to when joining the VPN.

Grafana provides dashboards for monitoring. It pulls metrics from Prometheus and logs from Loki, giving visibility into server health and application behavior.

Architecture Diagram¶

flowchart TB
    subgraph Internet
        Users([Users])
        Admin([Admin])
    end

    subgraph Cloudflare
        Tunnel[Tunnel]
    end

    subgraph Hetzner["Hetzner Cloud"]
        HS[Headscale<br/>159.69.247.237]

        subgraph Private["Private Network 10.0.0.0/16"]
            B[bastion<br/>10.0.1.1]
            Apps[apps<br/>10.0.2.1]
        end
    end

    Users --> Tunnel --> Apps
    Admin --> HS --> B -->|SSH| Apps

Traffic Paths¶

There are two distinct paths for accessing infrastructure:

Path	Use Case	Flow
Public	Web services (Authentik, Grafana, etc.)	Internet → Cloudflare Tunnel → cloudflared on apps → Traefik → K3s service
Admin	SSH access for maintenance	Tailscale client → Headscale (OIDC auth) → Bastion (100.64.0.6) → SSH to 10.0.x.x

The admin path requires Authentik authentication via OIDC before Headscale grants VPN access. This ensures only authorized team members can SSH into servers.

Stack¶

Category	Tools
IaC	OpenTofu, Ansible, SOPS
Orchestration	K3s (apps), Podman + systemd (headscale)
Identity	Authentik (OIDC)
VPN	Headscale + Tailscale
Ingress	Cloudflare Tunnel → Traefik
Observability	Grafana, Prometheus, Loki
Analytics	Umami (self-hosted)
Security	CrowdSec (IDS/IPS)

Infrastructure is defined in two layers. OpenTofu (an open-source Terraform fork) manages cloud resources - servers, networks, firewalls, DNS records, GitHub repositories. Ansible configures the servers themselves - installing packages, deploying containers, managing configuration files.

Secrets are handled by SOPS with Age encryption. Secret files live in Git but are encrypted, so they can be version controlled without exposing credentials. For Kubernetes workloads, the SOPS Secrets Operator decrypts secrets in-cluster at deploy time.

Standalone containers (e.g., Traefik on headscale) run under Podman instead of Docker. The main advantage is rootless operation - containers run as a regular user rather than root, reducing the blast radius if a container is compromised. K3s workloads on apps are managed via Kubernetes manifests and Helm.

The observability stack (Grafana, Prometheus, Loki) runs on the apps server. Prometheus scrapes metrics from Node Exporter on each server, while Loki aggregates logs from Alloy. Grafana provides a unified interface to query both. Retention settings:

Prometheus: 5-day retention with 15GB size limit (auto-compacts when limit reached)
Loki: 7-day retention with 10GB limit

Alerting is handled by Alertmanager with Zulip integration for critical alerts (node health, pod crashes, disk space, PostgreSQL issues).