Skip to main content

Kubernetes Cluster Architecture

Source: Marc Mercer (SRE Lead) — sre-iac repository, Rev 1.0, 2026-02-24

Overview

Anshin Health operates three separate Kubernetes clusters rather than a single shared cluster.

Why multi-cluster?

  • Blast radius containment — misconfiguration in dev cannot affect prod or infra
  • Independent lifecycle management — each cluster upgrades on its own schedule
  • Resource isolation via Keystone — each cluster lives in its own OpenStack project with hard quotas
  • Security boundary alignment — production workloads are separated from development experimentation

Why EKS Anywhere? EKS Anywhere runs the same EKS Distro that powers AWS EKS — not a fork or reimplementation. Identical kubectl commands, identical RBAC, identical API server behavior. The AWS Load Balancer Controller, cluster autoscaler, external-dns, and all EKS-compatible tooling work identically. If Anshin Health ever needs to move workloads to AWS EKS, manifests and Helm charts transfer directly.

Cluster Definitions

ClusterPurposeKeystone ProjectWorkload Profile
devDevelopment workloads, experimentation, CI buildsdevEphemeral, bursty, tolerant of disruption
prodProduction services: ERPNext, customer-facing applicationsprodStable, monitored, change-controlled
infraShared infrastructure: GitLab, Infisical, CI runners, monitoringinfraAlways-on, elevated privileges, platform-critical

All three clusters run as Nova virtual machines on the OpenStack platform.

Isolation Model

Isolation between clusters operates at multiple layers:

  1. OpenStack project isolation — independent resource quotas (vCPU, RAM, storage, floating IPs) per cluster
  2. Network isolation — each cluster's nodes are on separate Neutron networks; inter-cluster communication goes through load balancer endpoints
  3. Storage isolation — each cluster's PVCs are provisioned in its own Cinder volume set, backed by Ceph
  4. Identity isolation — cluster admins have roles in their specific Keystone project only
Needs Input

Node sizing per cluster — specific vCPU, RAM, and disk allocations per node type (control plane vs. worker) for dev, prod, and infra clusters are pending.

EKS Anywhere Platform

API and Tooling Compatibility

ComponentAWS EKSEKS AnywhereCompatible?
kubectlYesYesIdentical
HelmYesYesIdentical
Kubernetes APIv1.28+v1.28+Same versions
AWS Load Balancer ControllerALB/NLBOctaviaYes, with Octavia backend
Cluster AutoscalerEC2 ASGNovaYes, with OpenStack provider
External DNSRoute53DesignateYes, with Designate provider
CSI DriversEBS CSICinder CSIYes, with Cinder backend
Needs Input

Cluster networking model (Cilium, Calico, or EKS default) is a pending decision.

Service Placement Matrix

ServiceClusterRationale
ERPNextprodCustomer-facing business application
Customer-facing applicationsprodProduction traffic, SLA-bound
GitLabinfraShared CI/CD, needs stability independent of prod deploys
InfisicalinfraSecrets management, must be available for all clusters
CI RunnersinfraBuild workloads should not compete with production resources
Monitoring (kube-prometheus-stack)infraObservability must survive production incidents
Docker Hub CacheinfraShared image caching for all clusters
Authentik SSOinfraIdentity provider, shared service
external-dnsPer-clusterEach cluster manages its own DNS records
cert-managerPer-clusterStandard per-cluster pattern
MetalLBN/AReplaced by Octavia via AWS Load Balancer Controller

Storage Integration

Storage Classes (Cinder CSI)

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: cinder-ssd
provisioner: cinder.csi.openstack.org
parameters:
type: ssd # Maps to Cinder volume type backed by Ceph SSD pool
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: cinder-hdd
provisioner: cinder.csi.openstack.org
parameters:
type: hdd # Maps to Cinder volume type backed by Ceph HDD pool
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

For ReadWriteMany volumes: CephFS CSI provides RWX volumes for shared filesystem access across pods.

Needs Input

Specific storage class definitions including Cinder volume type names, default storage class per cluster, reclaim policies per environment, and volume snapshot class configuration.

Networking

Load Balancing via Octavia

When a Kubernetes Service of type LoadBalancer is created, the AWS Load Balancer Controller:

  1. Creates an Octavia load balancer in the cluster's Keystone project
  2. Configures listeners and pools matching the Service spec
  3. Registers worker node IPs as pool members
  4. Assigns a floating IP to the load balancer VIP

DNS via Designate + external-dns

Each cluster runs external-dns with the Designate provider. When Ingress resources or annotated Services are created, external-dns automatically creates DNS records in Designate, which is backed by FreeIPA DNS on the anshinhealth.net domain.

Migration from K3s

Current State (K3s on Proxmox LXC)

ParameterCurrent (K3s)
DistributionK3s
Nodes6 bare-metal LXC containers on Proxmox
Load balancingMetalLB (L2 mode) — VIP: 10.10.98.40
DNSexternal-dns
Monitoringkube-prometheus-stack
SSOAuthentik

Services to Migrate

ServiceCurrent StateTarget ClusterPriority
ERPNextRunningprodHigh — business-critical
Authentik SSORunninginfraHigh — dependency for other services
external-dnsRunningPer-clusterHigh — needed for DNS automation
Monitoring (kube-prometheus-stack)RunninginfraHigh — observability
InfisicalRunninginfraHigh — secrets dependency
Docker Hub CacheRunninginfraMedium — operational convenience
MetalLBRunningN/AN/A — architecture change (Octavia replaces)

Target State

ParameterTarget (EKS Anywhere)
DistributionEKS Anywhere (EKS Distro)
Clusters3 (dev, prod, infra)
NodesNova VMs on OpenStack
Load balancingOctavia via AWS Load Balancer Controller
DNSDesignate + external-dns
Monitoringkube-prometheus-stack (infra cluster)
SSOAuthentik (infra cluster)
Needs Input

Final migration order has not been confirmed. Recommended sequence:

  1. Infra cluster first — GitLab, Infisical, monitoring, Authentik (foundational services)
  2. Prod cluster — ERPNext and customer-facing applications (requires stable infra)
  3. Dev cluster last — most tolerant of disruption, can remain on K3s longest
  4. K3s decommission — once all services validated on EKS Anywhere

Migration Considerations

  • PVC data: Must migrate from current storage backend to Cinder/Ceph-backed PVCs (Velero or export/import)
  • DNS cutover: Service DNS records update to new Octavia load balancer IPs via external-dns
  • Secret migration: Secrets in current Infisical deployment must be verified in new deployment
  • ERPNext downtime: Requires maintenance window for database migration and DNS propagation

Document Control

RevDateAuthorDescription
1.02026-02-24Marc MercerInitial release