Kubernetes Cluster Architecture

Source: Marc Mercer (SRE Lead) — sre-iac repository, Rev 1.0, 2026-02-24

Overview

Anshin Health operates three separate Kubernetes clusters rather than a single shared cluster.

Why multi-cluster?

Blast radius containment — misconfiguration in dev cannot affect prod or infra
Independent lifecycle management — each cluster upgrades on its own schedule
Resource isolation via Keystone — each cluster lives in its own OpenStack project with hard quotas
Security boundary alignment — production workloads are separated from development experimentation

Why EKS Anywhere? EKS Anywhere runs the same EKS Distro that powers AWS EKS — not a fork or reimplementation. Identical kubectl commands, identical RBAC, identical API server behavior. The AWS Load Balancer Controller, cluster autoscaler, external-dns, and all EKS-compatible tooling work identically. If Anshin Health ever needs to move workloads to AWS EKS, manifests and Helm charts transfer directly.

Cluster Definitions

Cluster	Purpose	Keystone Project	Workload Profile
dev	Development workloads, experimentation, CI builds	dev	Ephemeral, bursty, tolerant of disruption
prod	Production services: ERPNext, customer-facing applications	prod	Stable, monitored, change-controlled
infra	Shared infrastructure: GitLab, Infisical, CI runners, monitoring	infra	Always-on, elevated privileges, platform-critical

All three clusters run as Nova virtual machines on the OpenStack platform.

Isolation Model

Isolation between clusters operates at multiple layers:

OpenStack project isolation — independent resource quotas (vCPU, RAM, storage, floating IPs) per cluster
Network isolation — each cluster's nodes are on separate Neutron networks; inter-cluster communication goes through load balancer endpoints
Storage isolation — each cluster's PVCs are provisioned in its own Cinder volume set, backed by Ceph
Identity isolation — cluster admins have roles in their specific Keystone project only

Needs Input

Node sizing per cluster — specific vCPU, RAM, and disk allocations per node type (control plane vs. worker) for dev, prod, and infra clusters are pending.

EKS Anywhere Platform

API and Tooling Compatibility

Component	AWS EKS	EKS Anywhere	Compatible?
kubectl	Yes	Yes	Identical
Helm	Yes	Yes	Identical
Kubernetes API	v1.28+	v1.28+	Same versions
AWS Load Balancer Controller	ALB/NLB	Octavia	Yes, with Octavia backend
Cluster Autoscaler	EC2 ASG	Nova	Yes, with OpenStack provider
External DNS	Route53	Designate	Yes, with Designate provider
CSI Drivers	EBS CSI	Cinder CSI	Yes, with Cinder backend

Needs Input

Cluster networking model (Cilium, Calico, or EKS default) is a pending decision.

Service Placement Matrix

Service	Cluster	Rationale
ERPNext	prod	Customer-facing business application
Customer-facing applications	prod	Production traffic, SLA-bound
GitLab	infra	Shared CI/CD, needs stability independent of prod deploys
Infisical	infra	Secrets management, must be available for all clusters
CI Runners	infra	Build workloads should not compete with production resources
Monitoring (kube-prometheus-stack)	infra	Observability must survive production incidents
Docker Hub Cache	infra	Shared image caching for all clusters
Authentik SSO	infra	Identity provider, shared service
external-dns	Per-cluster	Each cluster manages its own DNS records
cert-manager	Per-cluster	Standard per-cluster pattern
MetalLB	N/A	Replaced by Octavia via AWS Load Balancer Controller

Storage Integration

Storage Classes (Cinder CSI)

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: cinder-ssd
provisioner: cinder.csi.openstack.org
parameters:
  type: ssd    # Maps to Cinder volume type backed by Ceph SSD pool
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: cinder-hdd
provisioner: cinder.csi.openstack.org
parameters:
  type: hdd    # Maps to Cinder volume type backed by Ceph HDD pool
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

For ReadWriteMany volumes: CephFS CSI provides RWX volumes for shared filesystem access across pods.

Needs Input

Specific storage class definitions including Cinder volume type names, default storage class per cluster, reclaim policies per environment, and volume snapshot class configuration.

Networking

Load Balancing via Octavia

When a Kubernetes Service of type LoadBalancer is created, the AWS Load Balancer Controller:

Creates an Octavia load balancer in the cluster's Keystone project
Configures listeners and pools matching the Service spec
Registers worker node IPs as pool members
Assigns a floating IP to the load balancer VIP

DNS via Designate + external-dns

Each cluster runs external-dns with the Designate provider. When Ingress resources or annotated Services are created, external-dns automatically creates DNS records in Designate, which is backed by FreeIPA DNS on the anshinhealth.net domain.

Migration from K3s

Current State (K3s on Proxmox LXC)

Parameter	Current (K3s)
Distribution	K3s
Nodes	6 bare-metal LXC containers on Proxmox
Load balancing	MetalLB (L2 mode) — VIP: 10.10.98.40
DNS	external-dns
Monitoring	kube-prometheus-stack
SSO	Authentik

Services to Migrate

Service	Current State	Target Cluster	Priority
ERPNext	Running	prod	High — business-critical
Authentik SSO	Running	infra	High — dependency for other services
external-dns	Running	Per-cluster	High — needed for DNS automation
Monitoring (kube-prometheus-stack)	Running	infra	High — observability
Infisical	Running	infra	High — secrets dependency
Docker Hub Cache	Running	infra	Medium — operational convenience
MetalLB	Running	N/A	N/A — architecture change (Octavia replaces)

Target State

Parameter	Target (EKS Anywhere)
Distribution	EKS Anywhere (EKS Distro)
Clusters	3 (dev, prod, infra)
Nodes	Nova VMs on OpenStack
Load balancing	Octavia via AWS Load Balancer Controller
DNS	Designate + external-dns
Monitoring	kube-prometheus-stack (infra cluster)
SSO	Authentik (infra cluster)

Recommended Migration Order

Needs Input

Final migration order has not been confirmed. Recommended sequence:

Infra cluster first — GitLab, Infisical, monitoring, Authentik (foundational services)
Prod cluster — ERPNext and customer-facing applications (requires stable infra)
Dev cluster last — most tolerant of disruption, can remain on K3s longest
K3s decommission — once all services validated on EKS Anywhere

Migration Considerations

PVC data: Must migrate from current storage backend to Cinder/Ceph-backed PVCs (Velero or export/import)
DNS cutover: Service DNS records update to new Octavia load balancer IPs via external-dns
Secret migration: Secrets in current Infisical deployment must be verified in new deployment
ERPNext downtime: Requires maintenance window for database migration and DNS propagation

Document Control

Rev	Date	Author	Description
1.0	2026-02-24	Marc Mercer	Initial release

Overview​

Cluster Definitions​

Isolation Model​

EKS Anywhere Platform​

API and Tooling Compatibility​

Service Placement Matrix​

Storage Integration​

Storage Classes (Cinder CSI)​

Networking​

Load Balancing via Octavia​

DNS via Designate + external-dns​

Migration from K3s​

Current State (K3s on Proxmox LXC)​

Services to Migrate​

Target State​

Recommended Migration Order​

Migration Considerations​

Document Control​

Overview

Cluster Definitions

Isolation Model

EKS Anywhere Platform

API and Tooling Compatibility

Service Placement Matrix

Storage Integration

Storage Classes (Cinder CSI)

Networking

Load Balancing via Octavia

DNS via Designate + external-dns

Migration from K3s

Current State (K3s on Proxmox LXC)

Services to Migrate

Target State

Recommended Migration Order

Migration Considerations

Document Control