Monitoring Stack
Source: Marc Mercer (SRE Lead) β
sre-iacrepository, Rev 1.0, 2026-01-24This document covers the Anshin Platform monitoring stack deployed in the
monitoringnamespace of the K3s cluster. All monitoring services are accessible only via VPN.
Architectureβ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β monitoring namespace β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Prometheus βββββΆβ AlertManager βββββΆβ Cliq Trans ββββββββΌβββΆ Zoho Cliq
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β β β
β β βΌ β
β β ββββββββββββββββ β
β β β Karma β β
β β ββββββββββββββββ β
β βΌ β
β ββββββββββββββββ ββββββββββββββββ β
β β Grafana β β Blackbox β β
β ββββββββββββββββ β Exporter β β
β ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Access URLsβ
All URLs are VPN-only β not exposed to the public internet.
| Service | URL | Purpose |
|---|---|---|
| Grafana | https://grafana.mon.anshinhealth.net | Dashboards & visualization |
| Prometheus | https://prometheus.mon.anshinhealth.net | Metrics storage & queries |
| AlertManager | https://alertmanager.mon.anshinhealth.net | Alert routing & silencing |
| Karma | https://karma.mon.anshinhealth.net | Alert dashboard (read-only view) |
| K8s Dashboard | https://dashboard.mon.anshinhealth.net | Kubernetes cluster overview |
Componentsβ
kube-prometheus-stackβ
Helm chart (prometheus-community/kube-prometheus-stack) deploying the full Prometheus Operator ecosystem:
| Component | Purpose |
|---|---|
| Prometheus Operator | Manages Prometheus and AlertManager lifecycle via CRDs |
| Prometheus | Metrics collection and storage |
| AlertManager | Alert deduplication, routing, and notification |
| Grafana | Dashboards and visualization |
| node-exporter | Host-level metrics (DaemonSet on all nodes) |
| kube-state-metrics | Kubernetes object state metrics (Cluster-wide) |
Values file: kubernetes/monitoring/helm-values/kube-prometheus-stack-values.yaml
Key configuration settings:
| Setting | Value |
|---|---|
prometheus.prometheusSpec.retention | 30d |
prometheus.prometheusSpec.retentionSize | 50GB |
alertmanager.config.receivers | Zoho Cliq webhook (via Cliq Translator) |
Blackbox Exporterβ
HTTP/HTTPS/TCP endpoint monitoring. Enabled within the kube-prometheus-stack deployment.
- Alert rules:
rules/blackbox-alerts.yaml - Dashboard:
dashboards/blackbox-dashboard-configmap.yaml - Scrape config:
prometheus.prometheusSpec.additionalScrapeConfigsβblackbox-internal-https
Monitored endpoints:
| Zone | Endpoints Checked |
|---|---|
| mon.anshinhealth.net | Monitoring services |
| dev.anshinhealth.net | Dev environment services |
| svcs.anshinhealth.net | Speech and shared services |
Karmaβ
AlertManager dashboard providing a read-only view of all active alerts.
- Values file:
karma/helm-values/karma-values.yaml - Connects to:
http://alertmanager-operated.monitoring.svc.cluster.local:9093
Cliq Translatorβ
Flask application that translates AlertManager webhook payloads to Zoho Cliq format.
-
Deployment:
cliq-translator/deployment.yaml -
Build:
cd cliq-translatordocker build -t registry.anshinhealth.net/sre/iac/cliq-translator:v1 .docker push registry.anshinhealth.net/sre/iac/cliq-translator:v1
Cliq Translator is currently paused pending a registry image path fix. See CURRENT-SERVICES.md for current status.
What's Monitoredβ
Kubernetes Cluster Metricsβ
- Pod health and restart counts
- CPU and memory utilization (node and pod level)
- Persistent volume usage
- API server health
- Control plane component availability
Anshin Application Servicesβ
Scraped from anshin-dev-svc and anshin-prod-svc namespaces:
anna, aqb, auth, dmdc, etl, facts, sdr, vrbu
HTTP Endpoint Probesβ
Blackbox Exporter probes all service endpoints for HTTP 2xx responses, TLS certificate validity, and response time.
Alertingβ
Alert Flowβ
Prometheus β AlertManager β Cliq Translator β Zoho Cliq
|
βΌ
Karma (real-time alert view)
Alert Severitiesβ
| Severity | Description | Expected Response |
|---|---|---|
| Critical | Service down, data loss risk | Immediate action required |
| Warning | Degraded performance, approaching limits | Investigate within 1 hour |
| Info | Notable events, non-urgent | Review during business hours |
Alert Destinationsβ
- Zoho Cliq
#alertschannel β all Critical and Warning alerts - Karma Dashboard β real-time view of all active alerts
Silencing During Maintenanceβ
- Go to AlertManager UI (
https://alertmanager.mon.anshinhealth.net) - Click Silence β New Silence
- Add matchers (e.g.,
namespace=maintenance) - Set duration and comment
- Click Create
Grafana Dashboardsβ
Available Dashboardsβ
| Folder | Dashboard | Content |
|---|---|---|
| Anshin Platform | Service Health Overview | Health status across all services |
| Anshin Platform | API Latency | Request latency by service and endpoint |
| Anshin Platform | Error Rates | Error rates by service |
| Uptime Monitoring | Blackbox Exporter β Internal Endpoints | HTTP probe status |
| Uptime Monitoring | SSL Certificate Expiry | Days until cert expiry per domain |
| Uptime Monitoring | Response Time Trends | Historical response time trends |
| Kubernetes (built-in) | Cluster Resource Utilization | Node/pod CPU and memory |
| Kubernetes (built-in) | Node Health | Node status and conditions |
| Kubernetes (built-in) | Pod Status | Pod lifecycle and restart tracking |
Creating Custom Dashboardsβ
- Log in to Grafana
- Click + β New Dashboard
- Add panels using Prometheus as the data source
- Save to the appropriate folder
Backing Up Grafana Dashboardsβ
Provisioned dashboards live in ConfigMaps. For custom dashboards, export before upgrading:
for uid in $(curl -s "https://grafana.mon.anshinhealth.net/api/search" | jq -r '.[].uid'); do
curl -s "https://grafana.mon.anshinhealth.net/api/dashboards/uid/$uid" > "dashboard-$uid.json"
done
Secrets Managementβ
All secrets are managed via Infisical (see Infisical Secrets Management).
| Secret Key | Description | Used By |
|---|---|---|
grafana-admin-user | Grafana admin username | Grafana |
grafana-admin-password | Grafana admin password | Grafana |
zoho-cliq-webhook-url | Zoho Cliq incoming webhook URL | Cliq Translator |
Infisical project: anshin-infrastructure-o-p-if | Environment: prod | Path: /monitoring
TLS Certificatesβ
Wildcard cert *.mon.anshinhealth.net deployed to the monitoring namespace as:
- Secret:
wildcard-mon-cert - Issued by: ZeroSSL via acme.sh + Ansible (see Certificate Management)
Deployment / Initial Setupβ
Prerequisitesβ
- Infisical Operator deployed (see Infisical Secrets Management)
- TLS certificate for
*.mon.anshinhealth.netissued and deployed - DNS zone
mon.anshinhealth.netconfigured in FreeIPA - VPN access to cluster
Deployment Orderβ
# 1. Deploy Infisical Operator (if not already done)
cd ../infisical
./deploy.sh
# 2. Create Infisical secrets in console:
# - grafana-admin-user
# - grafana-admin-password
# - zoho-cliq-webhook-url
# 3. Apply InfisicalSecret CR
kubectl apply -f ../infisical/infisical-secret-monitoring.yaml
# 4. Deploy certificates via Ansible
cd /path/to/ansible
ansible-playbook playbooks/certificates.yml -l k8s --tags monitoring
# 5. Deploy full monitoring stack
cd ../kubernetes/monitoring
./deploy.sh all
Adding New Monitored Endpointsβ
Edit helm-values/kube-prometheus-stack-values.yaml:
prometheus:
prometheusSpec:
additionalScrapeConfigs:
- job_name: 'blackbox-internal-https'
static_configs:
- targets:
# Add new endpoint here
- https://new-service.mon.anshinhealth.net
Then upgrade:
./deploy.sh prometheus
Adding New Alert Rulesβ
- Create a
PrometheusRulemanifest in therules/directory - Apply it:
kubectl apply -f rules/new-rule.yaml
Example rule:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-service-alerts
namespace: monitoring
spec:
groups:
- name: my-service
rules:
- alert: MyServiceDown
expr: up{job="my-service"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "{{ $labels.job }} is down"
Upgrading kube-prometheus-stackβ
helm repo update
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values helm-values/kube-prometheus-stack-values.yaml \
--wait
Troubleshootingβ
Alerts not reaching Zoho Cliqβ
-
Check Cliq Translator logs:
kubectl logs -n monitoring -l app=alertmanager-cliq-translator -
Verify the Infisical secret exists:
kubectl get secret monitoring-infisical-secrets -n monitoring -o yaml -
Test the webhook manually:
kubectl port-forward -n monitoring svc/alertmanager-cliq-translator 5000:5000curl -X POST http://localhost:5000/webhook \-H "Content-Type: application/json" \-d '{"status":"firing","alerts":[{"labels":{"alertname":"Test"},"annotations":{"summary":"Test alert"}}]}'
Prometheus not scraping targetsβ
- Check that a
ServiceMonitororPodMonitorresource exists for the target - Verify the selector labels match the service
- Check the Prometheus targets page:
https://prometheus.mon.anshinhealth.net/targets
Grafana cannot connect to Prometheusβ
- Check the Prometheus datasource configuration in Grafana (Settings β Data Sources)
- Verify Prometheus service is running:
kubectl get svc -n monitoring | grep prometheus
Check if a service is healthyβ
- Go to Karma β look for active alerts for the service
- Or: Grafana β Blackbox dashboard β check endpoint probe status
Document Controlβ
| Rev | Date | Author | Description |
|---|---|---|---|
| 1.0 | 2026-01-24 | Marc Mercer | Initial release |