Skip to main content

OpenStack Platform Design

Source: Marc Mercer (SRE Lead) — sre-iac repository, Rev 1.0, 2026-02-24

Overview

OpenStack serves as the private cloud platform for Anshin Health's staging and development environment. It delivers self-service infrastructure provisioning — compute, networking, storage, load balancing, DNS, and database services — through standard APIs that align architecturally with public cloud patterns.

Why Kolla-Ansible: Every OpenStack service runs in its own Docker container, providing clean isolation, consistent versioning, and straightforward upgrades. The entire platform configuration lives in version-controlled Ansible playbooks and a single globals.yml file. Kolla-Ansible supports rolling upgrades between OpenStack releases.

Deployment Architecture

Three-Node Converged Design

NodeHostnameRole
Server 1m35g9-stk01Control plane + Compute + Ceph OSD
Server 2m35g9-stk02Control plane + Compute + Ceph OSD
Server 3m35g9-stk03Control plane + Compute + Ceph OSD

Host OS: Ubuntu 24.04 LTS on all three nodes.

Infrastructure Service Clustering

ServiceCluster TypeNotes
MariaDBGalera (3-node)Synchronous multi-master replication
RabbitMQMirrored cluster (3-node)HA queues for OpenStack messaging
MemcachedDistributed (3-node)Token and catalog caching for Keystone
HAProxyActive/Passive with keepalivedVIP floats between nodes
Needs Input

HAProxy/keepalived VIP addresses for the control plane endpoint — not yet documented.

OpenStack release version (e.g., 2024.2 Dalmatian) — determines Kolla-Ansible branch.

globals.yml key settings to be documented:

  • kolla_base_distro (expected: ubuntu)
  • kolla_install_type (expected: source or binary)
  • network_interface and neutron_external_interface assignments
  • enable_* flags for optional services
  • TLS configuration for internal and external endpoints

Compute — Nova

Total Compute Capacity

ResourcePer NodeTotal (3 Nodes)
CPU Cores36 physical (72 threads)108 physical (216 threads)
RAM512 GB DDR41,536 GB (1.5 TB)
CPU ModelIntel Xeon E5-2697v4 @ 2.3 GHzBroadwell-EP, 45 MB L3 cache

With a typical 4:1 vCPU overcommit ratio, this yields ~864 vCPUs available for VM scheduling. With Ceph and OpenStack control plane overhead (~20% reservation), approximately 690 vCPUs and ~1.2 TB RAM are available for tenant workloads.

Nova Flavor Sizing Reference

# AWS Instance Types → OpenStack Flavors
t3.micro: { vcpus: 1, ram: 1024, disk: 8 }
t3.small: { vcpus: 1, ram: 2048, disk: 20 }
t3.medium: { vcpus: 2, ram: 4096, disk: 30 }
t3.large: { vcpus: 2, ram: 8192, disk: 40 }
m5.large: { vcpus: 2, ram: 8192, disk: 40 }
m5.xlarge: { vcpus: 4, ram: 16384, disk: 80 }
c5.xlarge: { vcpus: 4, ram: 8192, disk: 40 }
r5.xlarge: { vcpus: 4, ram: 32768, disk: 40 }
p3.xlarge: { vcpus: 4, ram: 16384, disk: 40, gpu: 1 } # RTX 3090
Needs Input

Specific flavor names and sizing (whether to use AWS naming or independent naming) is a pending decision.

Networking — Neutron/OVN

Neutron uses OVN (Open Virtual Network) as its ML2 mechanism driver, providing distributed virtual routing and switching without separate network nodes.

VLANSubnetPurposeRouted?
VLAN 20010.20.0.0/24OpenStack control plane (API, messaging, DB)No — management only
VLAN 30010.30.0.0/24Ceph public network (client-to-OSD)No — storage only
VLAN 40010.40.0.0/24Ceph cluster network (OSD-to-OSD replication)No — storage only

Tenant VLAN range: 1000-1099 on physical network physnet1 (100 tenant VLANs available).

Needs Input
  • Provider network configuration (external/floating IP network)
  • Default security group rules
  • OVN northbound/southbound DB placement and chassis configuration
  • DHCP and metadata agent configuration
  • DNS integration with Designate

Storage — Ceph

SSD Pool (High-Performance)

ParameterValue
Disks per node8× 1TB SSD
Total nodes3
Raw capacity24 TB
Replication factorRF=2
Usable capacity~12 TB
Use caseCinder block volumes, VM root disks, database volumes

HDD Pool (Capacity)

ParameterValue
Disks per node16× 1TB HDD
Total nodes3
Raw capacity48 TB
Replication factorRF=2
Usable capacity~24 TB
Use caseBulk storage, RadosGW object storage, backups, cold data

Why RF=2 (not RF=3): Three nodes means RF=3 provides no additional failure domain protection — losing one node already risks availability regardless of replica count. RF=2 maximizes usable capacity for a staging environment with synthetic data.

Ceph Services

ServiceBackendPurpose
RBDCinderBlock storage for VM disks, allows SSD vs HDD volume types
RadosGWS3 APIS3-compatible object storage — same boto3/AWS CLI code works
CephFSManilaPOSIX-compatible shared filesystem (NFS-compatible shares)
NetworkVLANSubnetPurpose
PublicVLAN 30010.30.0.0/24Client-to-OSD traffic
ClusterVLAN 40010.40.0.0/24OSD-to-OSD replication, recovery
Needs Input
  • Pool names and assignment (e.g., ssd-volumes, hdd-volumes, rgw-data)
  • CRUSH rules mapping pools to device classes (SSD vs. HDD)
  • RadosGW endpoint URL and realm/zone configuration
  • MON and MGR placement
  • Ceph Dashboard enablement

Identity — Keystone

Keystone is federated with FreeIPA for centralized identity management. User authentication is handled by FreeIPA (LDAP-backed) — not Keystone's local database.

ServerHostnameRole
dc-01dc-01.anshinhealth.netPrimary IPA server
dc-02dc-02.anshinhealth.netIPA replica

Keystone Projects

ProjectPurpose
adminPlatform administration
devDevelopment cluster resources (EKS Anywhere dev)
prodProduction cluster resources (EKS Anywhere prod)
infraInfrastructure services cluster resources

Advanced Services

ServiceAWS EquivalentUse Cases
OctaviaALB / NLBEKS Anywhere load balancers, L4/L7 balancing
DesignateRoute 53Automatic DNS for floating IPs, external-dns integration
TroveRDSManaged PostgreSQL, MySQL, Redis instances
HeatCloudFormationStack templates, auto-scaling groups
ManilaEFSNFS-compatible shared filesystems via CephFS

GPU Node Integration

Server 4 (m35g9-pmx01) is kept outside of OpenStack management because:

  • Direct GPU access for ML/AI workloads is simpler outside Nova's virtualization layer
  • GPU workloads benefit from dedicated, non-shared hardware access
  • Separate lifecycle from the OpenStack platform

The GPU node connects to Ceph public network (VLAN 300) and can mount RBD volumes, CephFS, and use RadosGW (S3). It does NOT connect to Ceph cluster network (VLAN 400) — no OSDs on this node.


Document Control

RevDateAuthorDescription
1.02026-02-24Marc MercerInitial release