Skip to main content

Architecture Overview

Source: Marc Mercer (SRE Lead) — sre-iac repository + Bryan Lee (Engineering Director), Rev 2.0, 2026-03-22

Current vs Future State

This document covers three distinct states of the Anshin Health infrastructure. Sections are clearly labeled [CURRENT STATE], [FUTURE STATE], or [PLANNED] so it is always clear what is running today versus what is being built toward.


Vision

Anshin Health operates a private cloud infrastructure built on OpenStack with AWS-parity service abstractions. The platform provides the same compute, storage, networking, and container orchestration primitives found in AWS — same Terraform patterns, same EKS distribution, same subnet architecture — so that workloads developed here could move to AWS if the business requires it. OpenStack is the production target platform. AWS parity is a design goal that preserves future optionality, not a migration plan.

This approach delivers full infrastructure ownership, zero licensing costs, and the confidence that architectural decisions made today will not become technical debt if cloud requirements change.

Compliance Posture

Anshin Health operates in the healthcare domain. While the staging environment does not process real Protected Health Information (PHI) — all patient data is synthetic, generated via Synthea — the infrastructure is designed to align with SOC 2 Type II and HIPAA Security Rule controls. This ensures that architectural patterns, access controls, and operational procedures established in staging can be carried forward to production environments that do handle PHI.


Compute Platform — Three Tiers

The Anshin compute platform consists of three distinct tiers, each with a different operational role and deployment timeline:

TierHardwareStatusHypervisorPurpose
Proxmox Dev PlatformHP #1 (ML350 Gen9)✅ CURRENT STATEProxmox VEAll active K8s, services, development
OpenStack ClusterHP #2, #3 (+ HP #1 eventually, + HP #5)🔵 FUTURE STATEKolla-Ansible (Docker containers)Production private cloud
Open Air GPU FarmHP #4 (open-air ML350 configuration)🟡 COMING SOONProxmox VEAI inference, LLM serving, fine-tuning

[CURRENT STATE] Proxmox Development Platform

HP #1 (pmx-01, 10.10.96.5) is the sole active compute node. All Anshin services, Kubernetes workloads, and development infrastructure run here today.

What Is Running on Proxmox Now

pmx-01 (HP #1 — HP ProLiant ML350 Gen9)

├── K8s cluster (K3s v1.32.3 — 6-node on AlmaLinux 9.5 VMs)
│ ├── control-01 (10.10.97.1)
│ ├── control-02 (10.10.98.1)
│ ├── worker-01 (10.10.97.11)
│ ├── worker-02 (10.10.97.12)
│ ├── worker-03 (10.10.98.11)
│ └── worker-04 (10.10.98.12)

├── Application VMs
│ ├── rp-01 (10.10.96.22) — Caddy reverse proxy (14 service upstreams)
│ ├── gitlab-01 (10.10.96.41) — Self-hosted GitLab
│ ├── nb-01 (10.10.96.21) — NetBird VPN gateway
│ ├── db-01 (10.10.98.51) — Standalone PostgreSQL
│ ├── dc-01 (10.10.96.11) — FreeIPA primary domain controller
│ ├── dc-02 (10.10.96.12) — FreeIPA replica domain controller
│ └── deepseek-01 (10.10.96.80) — DeepSeek AI inference (SSH only)

└── Storage: QNAP NAS (qnap-01, 10.10.96.31) — NFS for all K8s PVs

Current network: Flat /20 (10.10.96.0/20), Cisco router as gateway, 3× Juniper EX2200 operating as unmanaged L2 switches. No VLANs yet. SRX320 pending deployment.

MetalLB VIP (ingress-nginx): 10.10.98.40 — all Kubernetes service ingress routes here.


[FUTURE STATE] OpenStack Private Cloud Cluster

HP #2 and HP #3 (currently powered off, iLO accessible) will form the OpenStack cluster. HP #1 will eventually migrate off Proxmox into OpenStack as well, once HP #2 and #3 are stable. A fifth ML350 Gen9 (not yet purchased) will be added as an OpenStack backup/spare node.

OpenStack Design Principles

PrincipleImplementation
Infrastructure as Code everywhereTerraform for provisioning, Ansible for configuration, Kolla-Ansible for OpenStack
Clean network segmentationSix functional VLANs across three physical switch planes
Compliance-aware by defaultSOC 2 / HIPAA access controls and audit logging from day one
AWS-parity abstractionsOpenStack services map 1:1 to AWS equivalents
Converged infrastructureCompute and Ceph storage co-located on the same nodes

OpenStack Services (Target)

OpenStack ServiceAWS EquivalentPurpose
NovaEC2Virtual machine provisioning
NeutronVPC / SubnetsSoftware-defined networking
Cinder + Ceph RBDEBSBlock storage (SSD and HDD tiers)
GlanceAMIVM image management
KeystoneIAMIdentity and access control
HeatCloudFormationInfrastructure orchestration
TroveRDSDatabase as a service
OctaviaALB / NLBLoad balancing
DesignateRoute 53DNS management
Ceph RadosGWS3Object storage (S3 API)
Manila + CephFSEFSShared filesystems

Ceph Storage (Target)

A converged Ceph cluster will run on HP #2 and HP #3 (and HP #1 when migrated), co-located with OpenStack compute. Two storage tiers:

  • SSD pool: 8× 1TB enterprise SSDs per node. Serves Cinder block volumes for databases, boot disks, high-IOPS workloads.
  • HDD pool: 16× 1TB enterprise HDDs per node. Serves RadosGW object storage, Manila shared filesystems, and bulk capacity.

Kubernetes on OpenStack (Target)

EKS Anywhere on Nova VMs — multi-cluster topology:

ClusterPurpose
devApplication development and CI
prodProduction-grade application hosting
infraPlatform services (Infisical, GitLab, monitoring)

HP #5 — OpenStack Backup Node [PLANNED]

A fifth HP ProLiant ML350 Gen9 with identical specifications to HP #1–#3 (2× E5-2697 v4, 512GB RAM, full storage complement) will be purchased once the OpenStack cluster is fully operational. This node provides spare capacity, resiliency, and a maintenance window target so rolling updates can be performed without reducing cluster capacity.


[FUTURE STATE] Open Air GPU Farm

HP #4 is an HP ProLiant ML350 Gen9 that has been stripped of its outer chassis and mounted in an open-air rack frame with 12 active cooling fans. The ML350 motherboard, CPUs, RAM, and PSUs are all in use — only the enclosure has been removed. This configuration allows mounting NVIDIA consumer/prosumer GPUs externally via PCIe risers, which would not fit inside the standard ML350 tower form factor.

This node runs Proxmox VE independently and is not part of the OpenStack cluster. No GPUs will reside in OpenStack nodes.

GPU Configuration

SlotGPUVRAMArchitectureInitial Role
1NVIDIA RTX 800048GB GDDR6Turing (TU102)Large model inference (70B+ params)
2NVIDIA RTX 309024GB GDDR6XAmpere (GA102)Mid-size models, embeddings, ASR/TTS
3NVIDIA RTX 309024GB GDDR6XAmpere (GA102)Fine-tuning, batch inference, RAG
4RESERVEDTBDTBDTo be determined post go-live

Total VRAM (initial): 96GB (48 + 24 + 24). The 4th GPU will be selected based on which models and workloads are in highest demand once the farm is operational.

OS storage: 1TB USB SSD (Proxmox VE OS and VM disks) + 64GB USB thumbdrive (recovery ISO / backup).

Each GPU gets its own dedicated Proxmox VM with PCIe passthrough (VT-d), providing near-native GPU performance and complete workload isolation.

HP #4 — Open Air GPU Farm (Proxmox VE)

├── gpu-vm-01 (RTX 8000 — 48GB)
│ ├── vCPUs: 16 RAM: 48GB
│ ├── Services: Ollama (large models), vLLM
│ ├── Models: Llama 3 70B, Qwen 72B, Mistral Large, DeepSeek 67B
│ └── Role: Primary LLM inference — largest context windows

├── gpu-vm-02 (RTX 3090 #1 — 24GB)
│ ├── vCPUs: 8 RAM: 40GB
│ ├── Services: Ollama (mid-size models), Faster-Whisper (ASR), Coqui/Kokoro TTS
│ ├── Models: Llama 3 8B, Phi-3, Mistral 7B, embedding models
│ └── Role: Real-time speech services + mid-size inference

├── gpu-vm-03 (RTX 3090 #2 — 24GB)
│ ├── vCPUs: 8 RAM: 40GB
│ ├── Services: Axolotl / Unsloth (fine-tuning), batch inference
│ ├── Models: Fine-tuning base: Llama 3 8B, Phi-3
│ └── Role: Fine-tuning pipelines, RAG, batch workloads

└── gpu-vm-04 (RESERVED — 4th GPU slot)
└── Configuration TBD after go-live based on observed demand

RAM allocation rationale: 128GB total − OS overhead (~4GB) = 124GB usable. VM allocation: 48 + 40 + 40 = 128GB planned. In practice Proxmox balloon driver and RAM not actively used by VMs will be reclaimed. Adjust per observation after go-live.

Network: 2× 10G SFP+ (10Gtek Intel 82599ES NIC):

  • Port 1 → SX3008-01 (network plane VLAN, inter-service communication)
  • Port 2 → SX3008-02 (storage plane VLAN, NFS access to QNAP / future Ceph)

AI Services Stack (per VM):

ServiceVMPurpose
Ollamagpu-vm-01, gpu-vm-02LLM serving with REST API
vLLMgpu-vm-01 (optional)High-throughput inference with batching
Faster-Whispergpu-vm-02ASR (speech-to-text)
Coqui TTS / Kokorogpu-vm-02Text-to-speech synthesis
Axolotl / Unslothgpu-vm-03Fine-tuning framework
OpenWebUIK8s (anshin-dev-svc)Web UI for all Ollama endpoints
No GPU in OpenStack

The three HP ML350 servers that will form the OpenStack cluster (HP #2, #3, and eventually HP #1) contain no GPUs. All GPU workloads are exclusively on HP #4 (Proxmox). This is deliberate — Proxmox provides superior PCIe passthrough and GPU isolation compared to OpenStack Nova with libvirt passthrough.


Identity Infrastructure — FreeIPA [CURRENT STATE]

FreeIPA provides centralized identity across all environments:

  • Domain: anshinhealth.net
  • Realm: ANSHINHEALTH.NET
  • DCs: dc-01 (10.10.96.11, primary), dc-02 (10.10.96.12, replica)
  • Services: DNS (internal split-brain), LDAP, Kerberos, Certificate Authority, NTP
  • Integration: Automated host enrollment, external-dns RFC2136 updates, LDAP auth for Grafana and other services

Network — Summary

PhaseStatusEdgeNetwork PlaneStorage PlaneOOB Plane
Current Reality✅ NOWCisco router (flat /20, no VLANs)3× EX2200 (unmanaged)Same flat networkSame flat network
Interim🔵 PENDINGJuniper SRX320Juniper EX2200-24T (VLANs configured)Juniper EX2200-24T (isolated)Juniper EX2200-24P
Target🔵 FUTURETP-Link ER7412-M2TL-SX3008F (10G) + TL-SG3428 (1G)TL-SX3008F (10G, isolated) + TL-SG3428 (1G)Juniper EX2200-24P (repurposed)

See Network Architecture and VLAN & IP Allocation for full detail.


Phase-Based Implementation

Phase 1: Foundation [IN PROGRESS]

  • ✅ Proxmox on HP #1 — all current services running
  • ✅ FreeIPA domain controllers on dc-01/dc-02
  • ✅ K3s cluster (6 nodes) with MetalLB, cert-manager, ingress-nginx
  • ✅ Caddy reverse proxy, GitLab, NetBird VPN
  • 🔵 SRX320 deployment — replace Cisco bridge, enable VLAN segmentation
  • 🔵 EX2200 VLAN configuration (configs ready in sre-iac repo)

Phase 2: GPU Farm [COMING SOON]

  • 🔵 Power on HP #4 open-air GPU farm
  • 🔵 Install Proxmox VE (1TB USB SSD)
  • 🔵 Configure PCIe passthrough (VT-d) for 3 GPUs
  • 🔵 Deploy gpu-vm-01, gpu-vm-02, gpu-vm-03
  • 🔵 Connect 10G NICs to TP-Link switch fabric
  • 🔵 Deploy Ollama, Faster-Whisper, TTS services
  • 🔵 Deploy TP-Link ER7412-M2 as edge router
  • 🔵 Deploy 2× SX3008F (network + storage 10G planes)
  • 🔵 Deploy 2× SG3428 (1G access switches)
  • 🔵 Install 10Gtek dual SFP+ NICs in all ML350 servers
  • 🔵 Connect 1.5m DAC cables (8 total: 4 servers × 2 planes)
  • 🔵 Configure Omada SDN — VLAN enforcement, policies

Phase 4: OpenStack Core [FUTURE]

  • 🔵 Install Ubuntu 24.04 LTS on HP #2 and HP #3
  • 🔵 LACP bonding and VLAN config on new switch fabric
  • 🔵 Kolla-Ansible OpenStack deployment
  • 🔵 Ceph cluster initialization (SSD + HDD pools)
  • 🔵 Neutron networking with VLAN provider networks
  • 🔵 Migrate HP #1 from Proxmox to OpenStack
  • 🔵 Purchase and integrate HP #5 (backup/spare node)

Phase 5: Kubernetes on OpenStack [FUTURE]

  • 🔵 EKS Anywhere cluster deployment on Nova VMs
  • 🔵 Multi-cluster topology (dev, prod, infra)
  • 🔵 GitOps via ArgoCD
  • 🔵 Advanced OpenStack services (Trove, Octavia, Designate, Manila, RadosGW)

Operational Model

LayerToolingPurpose
Infrastructure provisioningTerraform + AtmosVPC, networks, instances, volumes, object storage
Configuration managementAnsibleHost bootstrap, certificate deployment, domain enrollment
OpenStack deploymentKolla-AnsibleContainerized OpenStack control plane lifecycle
Cluster managementEKS Anywhere + ArgoCDKubernetes cluster lifecycle and GitOps
Secrets managementInfisicalCentralized secrets with Kubernetes operator integration
Observabilitykube-prometheus-stackPrometheus, Grafana, Alertmanager, Karma, Blackbox Exporter
Certificate managementacme.sh + ZeroSSLWildcard ECDSA certificates via DNS challenge

Document Control

ClassificationInternal — Infrastructure Documentation
Compliance ScopeSOC 2 Type II, HIPAA Security Rule (design target)
Data ClassificationNo PHI — Synthetic data only (Synthea) in staging environment
Review CycleQuarterly or upon significant infrastructure change
RevDateAuthorDescription
1.02026-02-24Marc MercerInitial release
2.02026-03-22Anshin EngineeringRestructured into Current/Future-OpenStack/Future-GPU sections; added HP#4 open-air GPU farm details, HP#5 planned node, Proxmox VM architecture recommendation, phase implementation status