Kubernetes, Monitoring, DevOps
12 min read

Complete Kubernetes Monitoring Guide for DevOps Engineers

Learn how to implement comprehensive monitoring and observability for Kubernetes clusters using Prometheus, Grafana, and modern DevOps practices

kubernetes prometheus grafana monitoring observability helm devops

Introduction

Kubernetes has become the de facto standard for container orchestration, but with great power comes great responsibility. Monitoring and observability are crucial for maintaining healthy, performant Kubernetes clusters in production environments.

In this comprehensive guide, we’ll explore the essential monitoring strategies, tools, and best practices that every DevOps engineer should implement for Kubernetes clusters.

Why Kubernetes Monitoring Matters

Before diving into implementation details, let’s understand why monitoring Kubernetes is critical:

  • Cluster Health: Monitor node status, pod health, and resource utilization
  • Application Performance: Track response times, error rates, and throughput
  • Resource Optimization: Identify underutilized or over-provisioned resources
  • Incident Response: Quick detection and resolution of issues
  • Capacity Planning: Data-driven decisions for scaling infrastructure

Monitoring Architecture Overview

A robust Kubernetes monitoring stack typically consists of:

# Example monitoring architecture
monitoring:
  data_collection:
    - node-exporter: system metrics
    - kube-state-metrics: Kubernetes objects
    - cadvisor: container metrics

  storage:
    - prometheus: time-series database
    - thanos: long-term storage

  visualization:
    - grafana: dashboards and alerts
    - alertmanager: notification routing

  logging:
    - fluentd: log aggregation
    - elasticsearch: log storage
    - kibana: log visualization

Installing Prometheus with Helm

Let’s start by deploying Prometheus using Helm, which is the recommended approach:

# Add the Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Create a namespace for monitoring
kubectl create namespace monitoring

# Install Prometheus with custom values
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values prometheus-values.yaml

Custom Prometheus Configuration

Create a prometheus-values.yaml file with optimized settings:

# prometheus-values.yaml
prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 100Gi

    additionalScrapeConfigs:
      - job_name: "kubernetes-service-endpoints"
        kubernetes_sd_configs:
          - role: endpoints
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            action: keep
            regex: true

grafana:
  adminPassword: "secure-password-here"
  persistence:
    enabled: true
    size: 10Gi

  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: "default"
          orgId: 1
          folder: ""
          type: file
          disableDeletion: false
          editable: true
          options:
            path: /var/lib/grafana/dashboards/default

Setting Up Node Exporter

Node Exporter collects system-level metrics from each Kubernetes node:

# node-exporter-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
  labels:
    app: node-exporter
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      containers:
        - name: node-exporter
          image: prom/node-exporter:v1.6.1
          ports:
            - containerPort: 9100
          args:
            - --path.procfs=/host/proc
            - --path.sysfs=/host/sys
            - --path.rootfs=/host/root
            - --web.listen-address=:9100
          volumeMounts:
            - name: proc
              mountPath: /host/proc
              readOnly: true
            - name: sys
              mountPath: /host/sys
              readOnly: true
            - name: root
              mountPath: /host/root
              readOnly: true
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys
        - name: root
          hostPath:
            path: /

Kube-State-Metrics Configuration

Kube-state-metrics provides Kubernetes object metrics:

# kube-state-metrics-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-state-metrics
  namespace: monitoring
spec:
  replicas: 2
  selector:
    matchLabels:
      app: kube-state-metrics
  template:
    metadata:
      labels:
        app: kube-state-metrics
    spec:
      containers:
        - name: kube-state-metrics
          image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.8.2
          ports:
            - containerPort: 8080
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            timeoutSeconds: 5
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            timeoutSeconds: 5
          resources:
            requests:
              memory: 128Mi
              cpu: 100m
            limits:
              memory: 256Mi
              cpu: 200m

Essential Kubernetes Metrics

Cluster-Level Metrics

Monitor these key cluster metrics:

# Node count
count(kube_node_info)

# Pod count by namespace
count(kube_pod_info) by (namespace)

# Resource usage by node
kube_node_status_allocatable{resource="cpu"}
kube_node_status_allocatable{resource="memory"}

# Pod status distribution
count(kube_pod_status_phase) by (phase)

Application-Level Metrics

Track application performance:

# HTTP request rate
rate(http_requests_total[5m])

# Error rate
rate(http_requests_total{status=~"5.."}[5m])

# Response time percentiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Grafana Dashboard Setup

Create comprehensive dashboards for different stakeholders:

Cluster Overview Dashboard

{
  "dashboard": {
    "title": "Kubernetes Cluster Overview",
    "panels": [
      {
        "title": "Node Status",
        "type": "stat",
        "targets": [
          {
            "expr": "count(kube_node_info)",
            "legendFormat": "Total Nodes"
          }
        ]
      },
      {
        "title": "Pod Status",
        "type": "piechart",
        "targets": [
          {
            "expr": "count(kube_pod_status_phase) by (phase)",
            "legendFormat": "{{phase}}"
          }
        ]
      }
    ]
  }
}

Application Performance Dashboard

{
  "dashboard": {
    "title": "Application Performance",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{service}}"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
            "legendFormat": "{{service}} - 5xx Errors"
          }
        ]
      }
    ]
  }
}

Alerting Configuration

Set up critical alerts using Alertmanager:

# alertmanager-config.yaml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"

route:
  group_by: ["alertname"]
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: "slack-notifications"

receivers:
  - name: "slack-notifications"
    slack_configs:
      - channel: "#kubernetes-alerts"
        send_resolved: true
        title: '{{ template "slack.title" . }}'
        text: '{{ template "slack.text" . }}'

inhibit_rules:
  - source_match:
      severity: "critical"
    target_match:
      severity: "warning"
    equal: ["alertname", "dev", "instance"]

Critical Alert Rules

# prometheus-rules.yaml
groups:
  - name: kubernetes.rules
    rules:
      - alert: HighPodRestartRate
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High pod restart rate detected"
          description: "Pod {{ $labels.pod }} is restarting frequently"

      - alert: NodeHighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on node"
          description: "Node {{ $labels.instance }} has high CPU usage"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod is crash looping"
          description: "Pod {{ $labels.pod }} is restarting too frequently"

Log Aggregation with Fluentd

Implement centralized logging:

# fluentd-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
  namespace: monitoring
data:
  fluent.conf: |
    <source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head true
      <parse>
        @type json
        time_format %Y-%m-%dT%H:%M:%S.%NZ
      </parse>
    </source>

    <filter kubernetes.**>
      @type kubernetes_metadata
      @id filter_kube_metadata
    </filter>

    <match kubernetes.**>
      @type elasticsearch
      host elasticsearch-client.monitoring.svc.cluster.local
      port 9200
      logstash_format true
      logstash_prefix k8s
      <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.system.buffer
        flush_mode interval
        retry_type exponential_backoff
        flush_interval 5s
        retry_forever false
        retry_max_interval 30
        chunk_limit_size 2M
        queue_limit_length 8
        overflow_action block
      </buffer>
    </match>

Performance Optimization Tips

Prometheus Optimization

# prometheus-optimization.yaml
prometheus:
  prometheusSpec:
    # Increase retention for better historical analysis
    retention: 30d

    # Optimize storage
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 200Gi

    # Configure scrape intervals
    scrapeInterval: 30s
    evaluationInterval: 30s

    # Resource limits
    resources:
      requests:
        memory: 2Gi
        cpu: 500m
      limits:
        memory: 4Gi
        cpu: 1000m

Grafana Optimization

# grafana-optimization.yaml
grafana:
  # Enable persistence
  persistence:
    enabled: true
    size: 20Gi

  # Configure caching
  grafana.ini:
    server:
      root_url: https://grafana.yourdomain.com
    security:
      admin_user: admin
      admin_password: "secure-password"

    # Performance tuning
    performance:
      concurrent_user_sessions: 100
      max_concurrent_connections: 100

    # Caching
    cache:
      enabled: true
      max_size: 100MB
      ttl: 1h

Monitoring Best Practices

1. Start Small, Scale Gradually

  • Begin with essential metrics (CPU, memory, pod status)
  • Add application-specific metrics incrementally
  • Validate each metric’s usefulness before scaling

2. Use Labels Wisely

# Good labeling strategy
metadata:
  labels:
    app: frontend
    tier: web
    environment: production
    team: platform
    version: v1.2.0

3. Implement SLOs and SLIs

Define Service Level Objectives:

# SLO definition
slo:
  name: "API Availability"
  target: 99.9%
  measurement:
    - name: "HTTP Success Rate"
      query: 'rate(http_requests_total{status!~"5.."}[5m]) / rate(http_requests_total[5m])'
      threshold: 0.999

4. Regular Maintenance

  • Review and update alert thresholds monthly
  • Clean up unused metrics and dashboards
  • Update monitoring stack versions quarterly
  • Conduct monitoring effectiveness reviews

Troubleshooting Common Issues

Prometheus High Memory Usage

# prometheus-memory-optimization.yaml
prometheus:
  prometheusSpec:
    # Reduce retention if memory is an issue
    retention: 7d

    # Optimize storage
    storageSpec:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 50Gi

    # Resource limits
    resources:
      limits:
        memory: 2Gi

Grafana Performance Issues

# grafana-performance.yaml
grafana:
  # Increase resources
  resources:
    requests:
      memory: 512Mi
      cpu: 250m
    limits:
      memory: 1Gi
      cpu: 500m

  # Enable caching
  grafana.ini:
    cache:
      enabled: true
      max_size: 200MB

Conclusion

Implementing comprehensive Kubernetes monitoring requires careful planning and ongoing maintenance. Start with the basics and gradually build a robust monitoring ecosystem that provides visibility into your cluster’s health and performance.

Remember these key principles:

  1. Monitor what matters: Focus on business-critical metrics
  2. Automate everything: Use Infrastructure as Code for monitoring setup
  3. Document thoroughly: Maintain clear documentation for dashboards and alerts
  4. Test your monitoring: Regularly verify that alerts work and dashboards are accurate
  5. Iterate and improve: Continuously refine your monitoring strategy based on real-world usage

Next Steps

  • Set up the monitoring stack using the provided configurations
  • Create custom dashboards for your specific applications
  • Implement alerting for critical business metrics
  • Establish monitoring runbooks for your team
  • Plan regular monitoring reviews and improvements

This guide covers the fundamentals of Kubernetes monitoring. For advanced topics like distributed tracing, service mesh monitoring, or cost optimization, stay tuned for future articles.

Follow me on GitHub and LinkedIn for more DevOps insights and best practices.

YH

Youqing Han

DevOps Engineer

Share this article:

Stay Updated

Get the latest DevOps insights and best practices delivered to your inbox

No spam, unsubscribe at any time