Infrastructure Architecture¶
This document describes the Hetzner infrastructure where Minnova runs its internal tooling. The architecture prioritizes security (no inbound public access on application servers) and cost efficiency.
Servers¶
All servers run Debian. The setup uses three servers with distinct roles:
| Server | Specs | IPs / Access | Services |
|---|---|---|---|
| headscale | CX22 (2 vCPU, 4GB RAM) | Public IPv4 159.69.247.237 |
Headscale VPN coordinator, Traefik reverse proxy (Podman) |
| bastion | CX22 (2 vCPU, 4GB RAM) | Private 10.0.1.1 + VPN 100.64.0.6 (public IPv4 outbound only) |
Tailscale client, SSH jump host |
| apps | CX53 (16 vCPU, 32GB RAM) | Private 10.0.2.1 (public IPv4 outbound only) |
K3s cluster (cloudflared tunnel, Traefik, Authentik, ArgoCD, Grafana/Prometheus/Loki, Portainer, Homepage, Gatus/status, Forgejo, Nextcloud, Umami, Zulip, Kimai, Hoop, Glance) |
The headscale server is the only one intended to accept inbound traffic from the public internet (ports 80/443). It runs Traefik as a reverse proxy to handle TLS termination for the Headscale service.
The bastion server acts as an SSH jump host. It sits on the private network but is also connected to the Tailscale mesh network (100.64.0.6). Admins SSH to the bastion first, then hop to other servers. This keeps SSH access off the public internet entirely.
The apps server hosts internal services on a single-node K3s cluster. Public IPv4 is enabled for outbound-only traffic (updates, tunnel egress), but inbound access is blocked by firewall rules. Web traffic reaches apps through Cloudflare Tunnel → Traefik, and SSH access goes through the bastion.
Services¶
Web services are exposed through Cloudflare Tunnel with automatic HTTPS. The tunnel runs cloudflared on the apps server, which creates outbound-only connections to Cloudflare's edge. This means no inbound ports need to be opened - Cloudflare routes requests through the tunnel to the local services.
| Service | URL | Purpose |
|---|---|---|
| Authentik | auth.minnova.io | Single Sign-On provider for internal services |
| Headscale | headscale.minnova.io | VPN coordinator - allows Tailscale clients to join |
| Grafana | grafana.minnova.io | Dashboards for metrics and logs |
| Prometheus | prometheus.minnova.io | Metrics collection (5-day retention, 15GB limit) |
| Loki | (internal) | Log aggregation (7-day retention) |
| ArgoCD | argocd.minnova.io | GitOps controller for cluster apps |
| Portainer | portainer.minnova.io | K3s management UI |
| Forgejo | forgejo.minnova.io | Self-hosted Git with container registry |
| Gatus | status.minnova.io | Status page / health checks |
| Traefik | traefik.minnova.io | Ingress controller dashboard |
| Nextcloud | nextcloud.minnova.io | File sharing and collaboration |
| Homepage | homepage.minnova.io | Internal landing / service directory |
| Glance | glance.minnova.io | Dashboard with feeds and widgets |
| Umami | analytics.minnova.io | Privacy-focused web analytics |
| Zulip | zulip.minnova.io | Team chat and communication |
| Kimai | kimai.minnova.io | Time tracking |
| Hoop | hoop.minnova.io | Secure database/server access gateway |
The Oracle knowledge base runs on Cloudflare Pages (with Access), not on the apps cluster.
Authentik is the identity provider for all services. When you log into Headscale or Grafana, you're actually authenticating against Authentik via OIDC. This centralizes user management - add someone to Authentik and they can access all integrated services.
Headscale is the exception to the Cloudflare Tunnel pattern. The Tailscale protocol requires WebSocket POST which tunnels don't support, so Headscale needs a real public IP. It's the coordination server that Tailscale clients connect to when joining the VPN.
Grafana provides dashboards for monitoring. It pulls metrics from Prometheus and logs from Loki, giving visibility into server health and application behavior.
Architecture Diagram¶
flowchart TB
subgraph Internet
Users([Users])
Admin([Admin])
end
subgraph Cloudflare
Tunnel[Tunnel]
end
subgraph Hetzner["Hetzner Cloud"]
HS[Headscale<br/>159.69.247.237]
subgraph Private["Private Network 10.0.0.0/16"]
B[bastion<br/>10.0.1.1]
Apps[apps<br/>10.0.2.1]
end
end
Users --> Tunnel --> Apps
Admin --> HS --> B -->|SSH| Apps
Traffic Paths¶
There are two distinct paths for accessing infrastructure:
| Path | Use Case | Flow |
|---|---|---|
| Public | Web services (Authentik, Grafana, etc.) | Internet → Cloudflare Tunnel → cloudflared on apps → Traefik → K3s service |
| Admin | SSH access for maintenance | Tailscale client → Headscale (OIDC auth) → Bastion (100.64.0.6) → SSH to 10.0.x.x |
The admin path requires Authentik authentication via OIDC before Headscale grants VPN access. This ensures only authorized team members can SSH into servers.
Stack¶
| Category | Tools |
|---|---|
| IaC | OpenTofu, Ansible, SOPS |
| Orchestration | K3s (apps), Podman + systemd (headscale) |
| Identity | Authentik (OIDC) |
| VPN | Headscale + Tailscale |
| Ingress | Cloudflare Tunnel → Traefik |
| Observability | Grafana, Prometheus, Loki |
| Analytics | Umami (self-hosted) |
| Security | CrowdSec (IDS/IPS) |
Infrastructure is defined in two layers. OpenTofu (an open-source Terraform fork) manages cloud resources - servers, networks, firewalls, DNS records, GitHub repositories. Ansible configures the servers themselves - installing packages, deploying containers, managing configuration files.
Secrets are handled by SOPS with Age encryption. Secret files live in Git but are encrypted, so they can be version controlled without exposing credentials. For Kubernetes workloads, the SOPS Secrets Operator decrypts secrets in-cluster at deploy time.
Standalone containers (e.g., Traefik on headscale) run under Podman instead of Docker. The main advantage is rootless operation - containers run as a regular user rather than root, reducing the blast radius if a container is compromised. K3s workloads on apps are managed via Kubernetes manifests and Helm.
The observability stack (Grafana, Prometheus, Loki) runs on the apps server. Prometheus scrapes metrics from Node Exporter on each server, while Loki aggregates logs from Alloy. Grafana provides a unified interface to query both. Retention settings:
- Prometheus: 5-day retention with 15GB size limit (auto-compacts when limit reached)
- Loki: 7-day retention with 10GB limit
Alerting is handled by Alertmanager with Zulip integration for critical alerts (node health, pod crashes, disk space, PostgreSQL issues).