Kubernetes, Production, Best Practices
15 min read

Kubernetes Best Practices for Production

Essential Kubernetes best practices for production environments, covering security, reliability, monitoring, and performance optimization

kubernetes production security monitoring scaling reliability performance

Introduction

Kubernetes has become the de facto standard for container orchestration in production environments. However, running Kubernetes in production requires careful planning, proper configuration, and adherence to best practices. This comprehensive guide covers essential practices for deploying and managing Kubernetes clusters in production environments.

Security Best Practices

1. RBAC and Access Control

Implement proper Role-Based Access Control (RBAC) to ensure least privilege access.

# ClusterRole for application developers
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: app-developer
rules:
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps", "secrets"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: ["networking.k8s.io"]
    resources: ["ingresses"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

---
# ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: app-developer-binding
subjects:
  - kind: ServiceAccount
    name: app-developer
    namespace: production
roleRef:
  kind: ClusterRole
  name: app-developer
  apiGroup: rbac.authorization.k8s.io

---
# ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-developer
  namespace: production

2. Network Policies

Implement network policies to control pod-to-pod communication.

# Network policy for frontend pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: frontend-network-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: frontend
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - protocol: TCP
          port: 80
        - protocol: TCP
          port: 443
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: backend
      ports:
        - protocol: TCP
          port: 8080
    - to:
        - namespaceSelector:
            matchLabels:
              name: kube-system
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

---
# Network policy for backend pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-network-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: database
      ports:
        - protocol: TCP
          port: 5432
    - to:
        - namespaceSelector:
            matchLabels:
              name: kube-system
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

3. Pod Security Standards

Implement Pod Security Standards (PSS) for enhanced security.

# Pod Security Admission configuration
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

---
# Example of a restricted pod
apiVersion: apps/v1
kind: Deployment
metadata:
  name: secure-app
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: secure-app
  template:
    metadata:
      labels:
        app: secure-app
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 3000
        fsGroup: 2000
      containers:
        - name: app
          image: secure-app:latest
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            runAsNonRoot: true
            runAsUser: 1000
            capabilities:
              drop:
                - ALL
          volumeMounts:
            - name: tmp
              mountPath: /tmp
            - name: varlog
              mountPath: /var/log
      volumes:
        - name: tmp
          emptyDir: {}
        - name: varlog
          emptyDir: {}

Resource Management

1. Resource Requests and Limits

Always define resource requests and limits for all containers.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: resource-managed-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: resource-managed-app
  template:
    metadata:
      labels:
        app: resource-managed-app
    spec:
      containers:
        - name: app
          image: app:latest
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
              ephemeral-storage: "1Gi"
            limits:
              memory: "512Mi"
              cpu: "500m"
              ephemeral-storage: "2Gi"
          ports:
            - containerPort: 8080
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
          startupProbe:
            httpGet:
              path: /startup
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 30

2. Horizontal Pod Autoscaling

Implement HPA for automatic scaling based on metrics.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: resource-managed-app
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

3. Vertical Pod Autoscaling

Use VPA for automatic resource optimization.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: resource-managed-app
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: 100m
          memory: 50Mi
        maxAllowed:
          cpu: 1
          memory: 500Mi
        controlledValues: RequestsAndLimits

Monitoring and Observability

1. Prometheus Monitoring

Set up comprehensive monitoring with Prometheus.

# Prometheus configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        cluster: production
        environment: prod

    rule_files:
      - "k8s-rules.yml"
      - "app-rules.yml"

    scrape_configs:
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: kubernetes_pod_name

      - job_name: 'kubernetes-service-endpoints'
        kubernetes_sd_configs:
          - role: endpoints
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
            action: replace
            target_label: __scheme__
            regex: (https?)
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__

---
# Prometheus deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus:latest
          ports:
            - containerPort: 9090
          volumeMounts:
            - name: config
              mountPath: /etc/prometheus
            - name: storage
              mountPath: /prometheus
          args:
            - "--config.file=/etc/prometheus/prometheus.yml"
            - "--storage.tsdb.path=/prometheus"
            - "--web.console.libraries=/etc/prometheus/console_libraries"
            - "--web.console.templates=/etc/prometheus/consoles"
            - "--storage.tsdb.retention.time=200h"
            - "--web.enable-lifecycle"
      volumes:
        - name: config
          configMap:
            name: prometheus-config
        - name: storage
          persistentVolumeClaim:
            claimName: prometheus-pvc

2. Grafana Dashboards

Create comprehensive dashboards for monitoring.

# Grafana dashboard configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
data:
  kubernetes-cluster.json: |
    {
      "dashboard": {
        "title": "Kubernetes Cluster Overview",
        "panels": [
          {
            "title": "Cluster CPU Usage",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\"}[5m])) by (pod)",
                "legendFormat": "{{pod}}"
              }
            ]
          },
          {
            "title": "Cluster Memory Usage",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(container_memory_usage_bytes{container!=\"\"}) by (pod)",
                "legendFormat": "{{pod}}"
              }
            ]
          },
          {
            "title": "Pod Status",
            "type": "stat",
            "targets": [
              {
                "expr": "count(kube_pod_status_phase) by (phase)",
                "legendFormat": "{{phase}}"
              }
            ]
          }
        ]
      }
    }

3. Alerting Rules

Set up comprehensive alerting for production issues.

# Alerting rules configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
data:
  k8s-rules.yml: |
    groups:
    - name: kubernetes.rules
      rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for more than 5 minutes"

      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85% for more than 5 minutes"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"
          description: "Pod {{ $labels.pod }} is restarting {{ printf \"%.2f\" $value }} times / 5 minutes"

      - alert: NodeDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "Node {{ $labels.instance }} has been down for more than 5 minutes"

Backup and Disaster Recovery

1. Velero Backup Configuration

Implement automated backups with Velero.

# Velero backup schedule
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *" # Daily at 2 AM
  template:
    includedNamespaces:
      - production
      - kube-system
    includedResources:
      - deployments
      - services
      - configmaps
      - secrets
      - persistentvolumeclaims
      - persistentvolumes
    storageLocation: default
    volumeSnapshotLocations:
      - default
    ttl: 720h # 30 days

---
# Velero backup location
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: default
  namespace: velero
spec:
  provider: aws
  objectStorage:
    bucket: k8s-backups
  config:
    region: us-west-2

2. Etcd Backup

Ensure etcd backups for cluster state recovery.

#!/bin/bash
# etcd backup script
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db

Performance Optimization

1. Node Affinity and Anti-affinity

Use affinity rules to optimize pod placement.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: optimized-app
spec:
  replicas: 6
  selector:
    matchLabels:
      app: optimized-app
  template:
    metadata:
      labels:
        app: optimized-app
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - optimized-app
                topologyKey: kubernetes.io/hostname
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node-role.kubernetes.io/worker
                    operator: Exists
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              preference:
                matchExpressions:
                  - key: node-type
                    operator: In
                    values:
                      - high-performance
      containers:
        - name: app
          image: optimized-app:latest
          resources:
            requests:
              memory: "512Mi"
              cpu: "500m"
            limits:
              memory: "1Gi"
              cpu: "1000m"

2. Resource Quotas

Implement resource quotas to prevent resource exhaustion.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "4"
    requests.memory: 8Gi
    limits.cpu: "8"
    limits.memory: 16Gi
    requests.ephemeral-storage: 10Gi
    limits.ephemeral-storage: 20Gi
    persistentvolumeclaims: "10"
    services: "20"
    services.loadbalancers: "2"
    services.nodeports: "10"
    count/deployments.apps: "10"
    count/replicasets.apps: "20"
    count/statefulsets.apps: "5"
    count/jobs.batch: "5"
    count/cronjobs.batch: "5"

---
# LimitRange for default resource limits
apiVersion: v1
kind: LimitRange
metadata:
  name: production-limits
  namespace: production
spec:
  limits:
    - default:
        memory: 512Mi
        cpu: 500m
      defaultRequest:
        memory: 256Mi
        cpu: 250m
      type: Container
    - default:
        memory: 1Gi
        cpu: 1000m
      defaultRequest:
        memory: 512Mi
        cpu: 500m
      type: Pod

Logging and Tracing

1. Centralized Logging

Implement centralized logging with Fluentd and Elasticsearch.

# Fluentd configuration for log aggregation
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
data:
  fluent.conf: |
    <source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head true
      <parse>
        @type json
        time_format %Y-%m-%dT%H:%M:%S.%NZ
      </parse>
    </source>

    <filter kubernetes.**>
      @type kubernetes_metadata
      @id filter_kube_metadata
    </filter>

    <filter kubernetes.**>
      @type record_transformer
      enable_ruby true
      <record>
        log_level ${record['log'].match(/level=(\w+)/) ? $1 : 'info'}
        application ${record['kubernetes']['labels']['app']}
        namespace ${record['kubernetes']['namespace_name']}
        pod_name ${record['kubernetes']['pod_name']}
        container_name ${record['kubernetes']['container_name']}
      </record>
    </filter>

    <match kubernetes.**>
      @type elasticsearch
      host elasticsearch-master
      port 9200
      logstash_format true
      logstash_prefix k8s
      <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.system.buffer
        flush_mode interval
        retry_type exponential_backoff
        flush_interval 5s
        retry_forever false
        retry_max_interval 30
        chunk_limit_size 2M
        queue_limit_length 8
        overflow_action block
      </buffer>
    </match>

2. Distributed Tracing

Implement distributed tracing with Jaeger.

# Jaeger configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
        - name: jaeger
          image: jaegertracing/all-in-one:latest
          ports:
            - containerPort: 16686
            - containerPort: 14268
          env:
            - name: COLLECTOR_OTLP_ENABLED
              value: "true"
            - name: SPAN_STORAGE_TYPE
              value: "elasticsearch"
            - name: ES_SERVER_URLS
              value: "http://elasticsearch-master:9200"
          volumeMounts:
            - name: jaeger-storage
              mountPath: /tmp
      volumes:
        - name: jaeger-storage
          persistentVolumeClaim:
            claimName: jaeger-pvc

---
apiVersion: v1
kind: Service
metadata:
  name: jaeger
spec:
  selector:
    app: jaeger
  ports:
    - name: http-query
      port: 16686
      targetPort: 16686
    - name: http-collector
      port: 14268
      targetPort: 14268
  type: ClusterIP

Conclusion

Running Kubernetes in production requires a comprehensive approach that covers security, resource management, monitoring, backup, and performance optimization. By implementing these best practices, you can ensure your Kubernetes cluster is secure, reliable, and performant.

Remember that Kubernetes is a complex system, and production environments require ongoing maintenance, monitoring, and optimization. Regular reviews of your configuration and practices will help you maintain a robust and efficient production environment.

Key Takeaways

  • Security First: Implement RBAC, network policies, and pod security standards
  • Resource Management: Always define resource requests and limits, use HPA/VPA
  • Monitoring: Comprehensive monitoring with Prometheus, Grafana, and alerting
  • Backup: Automated backups with Velero and etcd snapshots
  • Performance: Use affinity rules, resource quotas, and optimization techniques
  • Observability: Centralized logging and distributed tracing
  • Maintenance: Regular updates, security patches, and performance tuning

Start with these practices and adapt them to your specific requirements. A well-configured production Kubernetes cluster will provide the foundation for reliable, scalable applications.

YH

Youqing Han

DevOps Engineer

Share this article:

Stay Updated

Get the latest DevOps insights and best practices delivered to your inbox

No spam, unsubscribe at any time