Skip to main content

Monitoring Stack

Source: Marc Mercer (SRE Lead) β€” sre-iac repository, Rev 1.0, 2026-01-24

This document covers the Anshin Platform monitoring stack deployed in the monitoring namespace of the K3s cluster. All monitoring services are accessible only via VPN.

Architecture​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ monitoring namespace β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Prometheus │───▢│ AlertManager │───▢│ Cliq Trans │──────┼──▢ Zoho Cliq
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β–Ό β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β”‚ Karma β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Grafana β”‚ β”‚ Blackbox β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Exporter β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Access URLs​

All URLs are VPN-only β€” not exposed to the public internet.

ServiceURLPurpose
Grafanahttps://grafana.mon.anshinhealth.netDashboards & visualization
Prometheushttps://prometheus.mon.anshinhealth.netMetrics storage & queries
AlertManagerhttps://alertmanager.mon.anshinhealth.netAlert routing & silencing
Karmahttps://karma.mon.anshinhealth.netAlert dashboard (read-only view)
K8s Dashboardhttps://dashboard.mon.anshinhealth.netKubernetes cluster overview

Components​

kube-prometheus-stack​

Helm chart (prometheus-community/kube-prometheus-stack) deploying the full Prometheus Operator ecosystem:

ComponentPurpose
Prometheus OperatorManages Prometheus and AlertManager lifecycle via CRDs
PrometheusMetrics collection and storage
AlertManagerAlert deduplication, routing, and notification
GrafanaDashboards and visualization
node-exporterHost-level metrics (DaemonSet on all nodes)
kube-state-metricsKubernetes object state metrics (Cluster-wide)

Values file: kubernetes/monitoring/helm-values/kube-prometheus-stack-values.yaml

Key configuration settings:

SettingValue
prometheus.prometheusSpec.retention30d
prometheus.prometheusSpec.retentionSize50GB
alertmanager.config.receiversZoho Cliq webhook (via Cliq Translator)

Blackbox Exporter​

HTTP/HTTPS/TCP endpoint monitoring. Enabled within the kube-prometheus-stack deployment.

  • Alert rules: rules/blackbox-alerts.yaml
  • Dashboard: dashboards/blackbox-dashboard-configmap.yaml
  • Scrape config: prometheus.prometheusSpec.additionalScrapeConfigs β†’ blackbox-internal-https

Monitored endpoints:

ZoneEndpoints Checked
mon.anshinhealth.netMonitoring services
dev.anshinhealth.netDev environment services
svcs.anshinhealth.netSpeech and shared services

Karma​

AlertManager dashboard providing a read-only view of all active alerts.

  • Values file: karma/helm-values/karma-values.yaml
  • Connects to: http://alertmanager-operated.monitoring.svc.cluster.local:9093

Cliq Translator​

Flask application that translates AlertManager webhook payloads to Zoho Cliq format.

  • Deployment: cliq-translator/deployment.yaml

  • Build:

    cd cliq-translator
    docker build -t registry.anshinhealth.net/sre/iac/cliq-translator:v1 .
    docker push registry.anshinhealth.net/sre/iac/cliq-translator:v1
Registry Path Fix Needed

Cliq Translator is currently paused pending a registry image path fix. See CURRENT-SERVICES.md for current status.


What's Monitored​

Kubernetes Cluster Metrics​

  • Pod health and restart counts
  • CPU and memory utilization (node and pod level)
  • Persistent volume usage
  • API server health
  • Control plane component availability

Anshin Application Services​

Scraped from anshin-dev-svc and anshin-prod-svc namespaces:

anna, aqb, auth, dmdc, etl, facts, sdr, vrbu

HTTP Endpoint Probes​

Blackbox Exporter probes all service endpoints for HTTP 2xx responses, TLS certificate validity, and response time.


Alerting​

Alert Flow​

Prometheus β†’ AlertManager β†’ Cliq Translator β†’ Zoho Cliq
|
β–Ό
Karma (real-time alert view)

Alert Severities​

SeverityDescriptionExpected Response
CriticalService down, data loss riskImmediate action required
WarningDegraded performance, approaching limitsInvestigate within 1 hour
InfoNotable events, non-urgentReview during business hours

Alert Destinations​

  • Zoho Cliq #alerts channel β€” all Critical and Warning alerts
  • Karma Dashboard β€” real-time view of all active alerts

Silencing During Maintenance​

  1. Go to AlertManager UI (https://alertmanager.mon.anshinhealth.net)
  2. Click Silence β†’ New Silence
  3. Add matchers (e.g., namespace=maintenance)
  4. Set duration and comment
  5. Click Create

Grafana Dashboards​

Available Dashboards​

FolderDashboardContent
Anshin PlatformService Health OverviewHealth status across all services
Anshin PlatformAPI LatencyRequest latency by service and endpoint
Anshin PlatformError RatesError rates by service
Uptime MonitoringBlackbox Exporter β€” Internal EndpointsHTTP probe status
Uptime MonitoringSSL Certificate ExpiryDays until cert expiry per domain
Uptime MonitoringResponse Time TrendsHistorical response time trends
Kubernetes (built-in)Cluster Resource UtilizationNode/pod CPU and memory
Kubernetes (built-in)Node HealthNode status and conditions
Kubernetes (built-in)Pod StatusPod lifecycle and restart tracking

Creating Custom Dashboards​

  1. Log in to Grafana
  2. Click + β†’ New Dashboard
  3. Add panels using Prometheus as the data source
  4. Save to the appropriate folder

Backing Up Grafana Dashboards​

Provisioned dashboards live in ConfigMaps. For custom dashboards, export before upgrading:

for uid in $(curl -s "https://grafana.mon.anshinhealth.net/api/search" | jq -r '.[].uid'); do
curl -s "https://grafana.mon.anshinhealth.net/api/dashboards/uid/$uid" > "dashboard-$uid.json"
done

Secrets Management​

All secrets are managed via Infisical (see Infisical Secrets Management).

Secret KeyDescriptionUsed By
grafana-admin-userGrafana admin usernameGrafana
grafana-admin-passwordGrafana admin passwordGrafana
zoho-cliq-webhook-urlZoho Cliq incoming webhook URLCliq Translator

Infisical project: anshin-infrastructure-o-p-if | Environment: prod | Path: /monitoring


TLS Certificates​

Wildcard cert *.mon.anshinhealth.net deployed to the monitoring namespace as:


Deployment / Initial Setup​

Prerequisites​

  1. Infisical Operator deployed (see Infisical Secrets Management)
  2. TLS certificate for *.mon.anshinhealth.net issued and deployed
  3. DNS zone mon.anshinhealth.net configured in FreeIPA
  4. VPN access to cluster

Deployment Order​

# 1. Deploy Infisical Operator (if not already done)
cd ../infisical
./deploy.sh

# 2. Create Infisical secrets in console:
# - grafana-admin-user
# - grafana-admin-password
# - zoho-cliq-webhook-url

# 3. Apply InfisicalSecret CR
kubectl apply -f ../infisical/infisical-secret-monitoring.yaml

# 4. Deploy certificates via Ansible
cd /path/to/ansible
ansible-playbook playbooks/certificates.yml -l k8s --tags monitoring

# 5. Deploy full monitoring stack
cd ../kubernetes/monitoring
./deploy.sh all

Adding New Monitored Endpoints​

Edit helm-values/kube-prometheus-stack-values.yaml:

prometheus:
prometheusSpec:
additionalScrapeConfigs:
- job_name: 'blackbox-internal-https'
static_configs:
- targets:
# Add new endpoint here
- https://new-service.mon.anshinhealth.net

Then upgrade:

./deploy.sh prometheus

Adding New Alert Rules​

  1. Create a PrometheusRule manifest in the rules/ directory
  2. Apply it:
kubectl apply -f rules/new-rule.yaml

Example rule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-service-alerts
namespace: monitoring
spec:
groups:
- name: my-service
rules:
- alert: MyServiceDown
expr: up{job="my-service"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "{{ $labels.job }} is down"

Upgrading kube-prometheus-stack​

helm repo update
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values helm-values/kube-prometheus-stack-values.yaml \
--wait

Troubleshooting​

Alerts not reaching Zoho Cliq​

  1. Check Cliq Translator logs:

    kubectl logs -n monitoring -l app=alertmanager-cliq-translator
  2. Verify the Infisical secret exists:

    kubectl get secret monitoring-infisical-secrets -n monitoring -o yaml
  3. Test the webhook manually:

    kubectl port-forward -n monitoring svc/alertmanager-cliq-translator 5000:5000
    curl -X POST http://localhost:5000/webhook \
    -H "Content-Type: application/json" \
    -d '{"status":"firing","alerts":[{"labels":{"alertname":"Test"},"annotations":{"summary":"Test alert"}}]}'

Prometheus not scraping targets​

  1. Check that a ServiceMonitor or PodMonitor resource exists for the target
  2. Verify the selector labels match the service
  3. Check the Prometheus targets page: https://prometheus.mon.anshinhealth.net/targets

Grafana cannot connect to Prometheus​

  1. Check the Prometheus datasource configuration in Grafana (Settings β†’ Data Sources)
  2. Verify Prometheus service is running:
    kubectl get svc -n monitoring | grep prometheus

Check if a service is healthy​

  1. Go to Karma β†’ look for active alerts for the service
  2. Or: Grafana β†’ Blackbox dashboard β†’ check endpoint probe status

Document Control​

RevDateAuthorDescription
1.02026-01-24Marc MercerInitial release