Monitoring Stack

Source: Marc Mercer (SRE Lead) — sre-iac repository, Rev 1.0, 2026-01-24

This document covers the Anshin Platform monitoring stack deployed in the monitoring namespace of the K3s cluster. All monitoring services are accessible only via VPN.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    monitoring namespace                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐      │
│  │  Prometheus  │───▶│ AlertManager │───▶│ Cliq Trans   │──────┼──▶ Zoho Cliq
│  └──────────────┘    └──────────────┘    └──────────────┘      │
│         │                   │                                    │
│         │                   ▼                                    │
│         │            ┌──────────────┐                           │
│         │            │    Karma     │                           │
│         │            └──────────────┘                           │
│         ▼                                                        │
│  ┌──────────────┐    ┌──────────────┐                           │
│  │   Grafana    │    │   Blackbox   │                           │
│  └──────────────┘    │   Exporter   │                           │
│                      └──────────────┘                           │
└─────────────────────────────────────────────────────────────────┘

Access URLs

All URLs are VPN-only — not exposed to the public internet.

Service	URL	Purpose
Grafana	`https://grafana.mon.anshinhealth.net`	Dashboards & visualization
Prometheus	`https://prometheus.mon.anshinhealth.net`	Metrics storage & queries
AlertManager	`https://alertmanager.mon.anshinhealth.net`	Alert routing & silencing
Karma	`https://karma.mon.anshinhealth.net`	Alert dashboard (read-only view)
K8s Dashboard	`https://dashboard.mon.anshinhealth.net`	Kubernetes cluster overview

Components

kube-prometheus-stack

Helm chart (prometheus-community/kube-prometheus-stack) deploying the full Prometheus Operator ecosystem:

Component	Purpose
Prometheus Operator	Manages Prometheus and AlertManager lifecycle via CRDs
Prometheus	Metrics collection and storage
AlertManager	Alert deduplication, routing, and notification
Grafana	Dashboards and visualization
node-exporter	Host-level metrics (DaemonSet on all nodes)
kube-state-metrics	Kubernetes object state metrics (Cluster-wide)

Values file: kubernetes/monitoring/helm-values/kube-prometheus-stack-values.yaml

Key configuration settings:

Setting	Value
`prometheus.prometheusSpec.retention`	`30d`
`prometheus.prometheusSpec.retentionSize`	`50GB`
`alertmanager.config.receivers`	Zoho Cliq webhook (via Cliq Translator)

Blackbox Exporter

HTTP/HTTPS/TCP endpoint monitoring. Enabled within the kube-prometheus-stack deployment.

Alert rules: rules/blackbox-alerts.yaml
Dashboard: dashboards/blackbox-dashboard-configmap.yaml
Scrape config: prometheus.prometheusSpec.additionalScrapeConfigs → blackbox-internal-https

Monitored endpoints:

Zone	Endpoints Checked
mon.anshinhealth.net	Monitoring services
dev.anshinhealth.net	Dev environment services
svcs.anshinhealth.net	Speech and shared services

Karma

AlertManager dashboard providing a read-only view of all active alerts.

Values file: karma/helm-values/karma-values.yaml
Connects to: http://alertmanager-operated.monitoring.svc.cluster.local:9093

Cliq Translator

Flask application that translates AlertManager webhook payloads to Zoho Cliq format.

Deployment: cliq-translator/deployment.yaml

Build:

cd cliq-translator
docker build -t registry.anshinhealth.net/sre/iac/cliq-translator:v1 .
docker push registry.anshinhealth.net/sre/iac/cliq-translator:v1

Registry Path Fix Needed

Cliq Translator is currently paused pending a registry image path fix. See CURRENT-SERVICES.md for current status.

What's Monitored

Kubernetes Cluster Metrics

Pod health and restart counts
CPU and memory utilization (node and pod level)
Persistent volume usage
API server health
Control plane component availability

Anshin Application Services

Scraped from anshin-dev-svc and anshin-prod-svc namespaces:

anna, aqb, auth, dmdc, etl, facts, sdr, vrbu

HTTP Endpoint Probes

Blackbox Exporter probes all service endpoints for HTTP 2xx responses, TLS certificate validity, and response time.

Alerting

Alert Flow

Prometheus → AlertManager → Cliq Translator → Zoho Cliq
                  |
                  ▼
               Karma (real-time alert view)

Alert Severities

Severity	Description	Expected Response
Critical	Service down, data loss risk	Immediate action required
Warning	Degraded performance, approaching limits	Investigate within 1 hour
Info	Notable events, non-urgent	Review during business hours

Alert Destinations

Zoho Cliq #alerts channel — all Critical and Warning alerts
Karma Dashboard — real-time view of all active alerts

Silencing During Maintenance

Go to AlertManager UI (https://alertmanager.mon.anshinhealth.net)
Click Silence → New Silence
Add matchers (e.g., namespace=maintenance)
Set duration and comment
Click Create

Grafana Dashboards

Available Dashboards

Folder	Dashboard	Content
Anshin Platform	Service Health Overview	Health status across all services
Anshin Platform	API Latency	Request latency by service and endpoint
Anshin Platform	Error Rates	Error rates by service
Uptime Monitoring	Blackbox Exporter — Internal Endpoints	HTTP probe status
Uptime Monitoring	SSL Certificate Expiry	Days until cert expiry per domain
Uptime Monitoring	Response Time Trends	Historical response time trends
Kubernetes (built-in)	Cluster Resource Utilization	Node/pod CPU and memory
Kubernetes (built-in)	Node Health	Node status and conditions
Kubernetes (built-in)	Pod Status	Pod lifecycle and restart tracking

Creating Custom Dashboards

Log in to Grafana
Click + → New Dashboard
Add panels using Prometheus as the data source
Save to the appropriate folder

Backing Up Grafana Dashboards

Provisioned dashboards live in ConfigMaps. For custom dashboards, export before upgrading:

for uid in $(curl -s "https://grafana.mon.anshinhealth.net/api/search" | jq -r '.[].uid'); do
  curl -s "https://grafana.mon.anshinhealth.net/api/dashboards/uid/$uid" > "dashboard-$uid.json"
done

Secrets Management

All secrets are managed via Infisical (see Infisical Secrets Management).

Secret Key	Description	Used By
`grafana-admin-user`	Grafana admin username	Grafana
`grafana-admin-password`	Grafana admin password	Grafana
`zoho-cliq-webhook-url`	Zoho Cliq incoming webhook URL	Cliq Translator

Infisical project: anshin-infrastructure-o-p-if | Environment: prod | Path: /monitoring

TLS Certificates

Wildcard cert *.mon.anshinhealth.net deployed to the monitoring namespace as:

Secret: wildcard-mon-cert
Issued by: ZeroSSL via acme.sh + Ansible (see Certificate Management)

Deployment / Initial Setup

Prerequisites

Infisical Operator deployed (see Infisical Secrets Management)
TLS certificate for *.mon.anshinhealth.net issued and deployed
DNS zone mon.anshinhealth.net configured in FreeIPA
VPN access to cluster

Deployment Order

# 1. Deploy Infisical Operator (if not already done)
cd ../infisical
./deploy.sh

# 2. Create Infisical secrets in console:
#    - grafana-admin-user
#    - grafana-admin-password
#    - zoho-cliq-webhook-url

# 3. Apply InfisicalSecret CR
kubectl apply -f ../infisical/infisical-secret-monitoring.yaml

# 4. Deploy certificates via Ansible
cd /path/to/ansible
ansible-playbook playbooks/certificates.yml -l k8s --tags monitoring

# 5. Deploy full monitoring stack
cd ../kubernetes/monitoring
./deploy.sh all

Adding New Monitored Endpoints

Edit helm-values/kube-prometheus-stack-values.yaml:

prometheus:
  prometheusSpec:
    additionalScrapeConfigs:
      - job_name: 'blackbox-internal-https'
        static_configs:
          - targets:
              # Add new endpoint here
              - https://new-service.mon.anshinhealth.net

Then upgrade:

./deploy.sh prometheus

Adding New Alert Rules

Create a PrometheusRule manifest in the rules/ directory
Apply it:

kubectl apply -f rules/new-rule.yaml

Example rule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-service-alerts
  namespace: monitoring
spec:
  groups:
    - name: my-service
      rules:
        - alert: MyServiceDown
          expr: up{job="my-service"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "{{ $labels.job }} is down"

Upgrading kube-prometheus-stack

helm repo update
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values helm-values/kube-prometheus-stack-values.yaml \
  --wait

Troubleshooting

Alerts not reaching Zoho Cliq

Check Cliq Translator logs:

kubectl logs -n monitoring -l app=alertmanager-cliq-translator

Verify the Infisical secret exists:

kubectl get secret monitoring-infisical-secrets -n monitoring -o yaml

Test the webhook manually:

kubectl port-forward -n monitoring svc/alertmanager-cliq-translator 5000:5000
curl -X POST http://localhost:5000/webhook \
  -H "Content-Type: application/json" \
  -d '{"status":"firing","alerts":[{"labels":{"alertname":"Test"},"annotations":{"summary":"Test alert"}}]}'

Prometheus not scraping targets

Check that a ServiceMonitor or PodMonitor resource exists for the target
Verify the selector labels match the service
Check the Prometheus targets page: https://prometheus.mon.anshinhealth.net/targets

Grafana cannot connect to Prometheus

Check the Prometheus datasource configuration in Grafana (Settings → Data Sources)

Verify Prometheus service is running:

kubectl get svc -n monitoring | grep prometheus

Check if a service is healthy

Go to Karma → look for active alerts for the service
Or: Grafana → Blackbox dashboard → check endpoint probe status

Document Control

Rev	Date	Author	Description
1.0	2026-01-24	Marc Mercer	Initial release

Architecture​

Access URLs​

Components​

kube-prometheus-stack​

Blackbox Exporter​

Karma​

Cliq Translator​

What's Monitored​

Kubernetes Cluster Metrics​

Anshin Application Services​

HTTP Endpoint Probes​

Alerting​

Alert Flow​

Alert Severities​

Alert Destinations​

Silencing During Maintenance​

Grafana Dashboards​

Available Dashboards​

Creating Custom Dashboards​

Backing Up Grafana Dashboards​

Secrets Management​

TLS Certificates​

Deployment / Initial Setup​

Prerequisites​

Deployment Order​

Adding New Monitored Endpoints​

Adding New Alert Rules​

Upgrading kube-prometheus-stack​

Troubleshooting​

Alerts not reaching Zoho Cliq​

Prometheus not scraping targets​

Grafana cannot connect to Prometheus​

Check if a service is healthy​

Document Control​