Kubernetes Best Practices for Production
Essential Kubernetes best practices for production environments, covering security, reliability, monitoring, and performance optimization
Introduction
Kubernetes has become the de facto standard for container orchestration in production environments. However, running Kubernetes in production requires careful planning, proper configuration, and adherence to best practices. This comprehensive guide covers essential practices for deploying and managing Kubernetes clusters in production environments.
Security Best Practices
1. RBAC and Access Control
Implement proper Role-Based Access Control (RBAC) to ensure least privilege access.
# ClusterRole for application developers
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: app-developer
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps", "secrets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["networking.k8s.io"]
resources: ["ingresses"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
# ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: app-developer-binding
subjects:
- kind: ServiceAccount
name: app-developer
namespace: production
roleRef:
kind: ClusterRole
name: app-developer
apiGroup: rbac.authorization.k8s.io
---
# ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
name: app-developer
namespace: production
2. Network Policies
Implement network policies to control pod-to-pod communication.
# Network policy for frontend pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: frontend-network-policy
namespace: production
spec:
podSelector:
matchLabels:
app: frontend
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 80
- protocol: TCP
port: 443
egress:
- to:
- podSelector:
matchLabels:
app: backend
ports:
- protocol: TCP
port: 8080
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
---
# Network policy for backend pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: backend-network-policy
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
3. Pod Security Standards
Implement Pod Security Standards (PSS) for enhanced security.
# Pod Security Admission configuration
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
---
# Example of a restricted pod
apiVersion: apps/v1
kind: Deployment
metadata:
name: secure-app
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: secure-app
template:
metadata:
labels:
app: secure-app
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
containers:
- name: app
image: secure-app:latest
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp
mountPath: /tmp
- name: varlog
mountPath: /var/log
volumes:
- name: tmp
emptyDir: {}
- name: varlog
emptyDir: {}
Resource Management
1. Resource Requests and Limits
Always define resource requests and limits for all containers.
apiVersion: apps/v1
kind: Deployment
metadata:
name: resource-managed-app
spec:
replicas: 3
selector:
matchLabels:
app: resource-managed-app
template:
metadata:
labels:
app: resource-managed-app
spec:
containers:
- name: app
image: app:latest
resources:
requests:
memory: "256Mi"
cpu: "250m"
ephemeral-storage: "1Gi"
limits:
memory: "512Mi"
cpu: "500m"
ephemeral-storage: "2Gi"
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
startupProbe:
httpGet:
path: /startup
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 30
2. Horizontal Pod Autoscaling
Implement HPA for automatic scaling based on metrics.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: resource-managed-app
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
3. Vertical Pod Autoscaling
Use VPA for automatic resource optimization.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: resource-managed-app
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 100m
memory: 50Mi
maxAllowed:
cpu: 1
memory: 500Mi
controlledValues: RequestsAndLimits
Monitoring and Observability
1. Prometheus Monitoring
Set up comprehensive monitoring with Prometheus.
# Prometheus configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: production
environment: prod
rule_files:
- "k8s-rules.yml"
- "app-rules.yml"
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
---
# Prometheus deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: storage
mountPath: /prometheus
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--web.console.libraries=/etc/prometheus/console_libraries"
- "--web.console.templates=/etc/prometheus/consoles"
- "--storage.tsdb.retention.time=200h"
- "--web.enable-lifecycle"
volumes:
- name: config
configMap:
name: prometheus-config
- name: storage
persistentVolumeClaim:
claimName: prometheus-pvc
2. Grafana Dashboards
Create comprehensive dashboards for monitoring.
# Grafana dashboard configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
data:
kubernetes-cluster.json: |
{
"dashboard": {
"title": "Kubernetes Cluster Overview",
"panels": [
{
"title": "Cluster CPU Usage",
"type": "graph",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"\"}[5m])) by (pod)",
"legendFormat": "{{pod}}"
}
]
},
{
"title": "Cluster Memory Usage",
"type": "graph",
"targets": [
{
"expr": "sum(container_memory_usage_bytes{container!=\"\"}) by (pod)",
"legendFormat": "{{pod}}"
}
]
},
{
"title": "Pod Status",
"type": "stat",
"targets": [
{
"expr": "count(kube_pod_status_phase) by (phase)",
"legendFormat": "{{phase}}"
}
]
}
]
}
}
3. Alerting Rules
Set up comprehensive alerting for production issues.
# Alerting rules configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
data:
k8s-rules.yml: |
groups:
- name: kubernetes.rules
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85% for more than 5 minutes"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.pod }} is restarting {{ printf \"%.2f\" $value }} times / 5 minutes"
- alert: NodeDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"
description: "Node {{ $labels.instance }} has been down for more than 5 minutes"
Backup and Disaster Recovery
1. Velero Backup Configuration
Implement automated backups with Velero.
# Velero backup schedule
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 2 * * *" # Daily at 2 AM
template:
includedNamespaces:
- production
- kube-system
includedResources:
- deployments
- services
- configmaps
- secrets
- persistentvolumeclaims
- persistentvolumes
storageLocation: default
volumeSnapshotLocations:
- default
ttl: 720h # 30 days
---
# Velero backup location
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: default
namespace: velero
spec:
provider: aws
objectStorage:
bucket: k8s-backups
config:
region: us-west-2
2. Etcd Backup
Ensure etcd backups for cluster state recovery.
#!/bin/bash
# etcd backup script
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db
Performance Optimization
1. Node Affinity and Anti-affinity
Use affinity rules to optimize pod placement.
apiVersion: apps/v1
kind: Deployment
metadata:
name: optimized-app
spec:
replicas: 6
selector:
matchLabels:
app: optimized-app
template:
metadata:
labels:
app: optimized-app
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- optimized-app
topologyKey: kubernetes.io/hostname
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/worker
operator: Exists
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: node-type
operator: In
values:
- high-performance
containers:
- name: app
image: optimized-app:latest
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
2. Resource Quotas
Implement resource quotas to prevent resource exhaustion.
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
requests.cpu: "4"
requests.memory: 8Gi
limits.cpu: "8"
limits.memory: 16Gi
requests.ephemeral-storage: 10Gi
limits.ephemeral-storage: 20Gi
persistentvolumeclaims: "10"
services: "20"
services.loadbalancers: "2"
services.nodeports: "10"
count/deployments.apps: "10"
count/replicasets.apps: "20"
count/statefulsets.apps: "5"
count/jobs.batch: "5"
count/cronjobs.batch: "5"
---
# LimitRange for default resource limits
apiVersion: v1
kind: LimitRange
metadata:
name: production-limits
namespace: production
spec:
limits:
- default:
memory: 512Mi
cpu: 500m
defaultRequest:
memory: 256Mi
cpu: 250m
type: Container
- default:
memory: 1Gi
cpu: 1000m
defaultRequest:
memory: 512Mi
cpu: 500m
type: Pod
Logging and Tracing
1. Centralized Logging
Implement centralized logging with Fluentd and Elasticsearch.
# Fluentd configuration for log aggregation
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
data:
fluent.conf: |
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
<filter kubernetes.**>
@type kubernetes_metadata
@id filter_kube_metadata
</filter>
<filter kubernetes.**>
@type record_transformer
enable_ruby true
<record>
log_level ${record['log'].match(/level=(\w+)/) ? $1 : 'info'}
application ${record['kubernetes']['labels']['app']}
namespace ${record['kubernetes']['namespace_name']}
pod_name ${record['kubernetes']['pod_name']}
container_name ${record['kubernetes']['container_name']}
</record>
</filter>
<match kubernetes.**>
@type elasticsearch
host elasticsearch-master
port 9200
logstash_format true
logstash_prefix k8s
<buffer>
@type file
path /var/log/fluentd-buffers/kubernetes.system.buffer
flush_mode interval
retry_type exponential_backoff
flush_interval 5s
retry_forever false
retry_max_interval 30
chunk_limit_size 2M
queue_limit_length 8
overflow_action block
</buffer>
</match>
2. Distributed Tracing
Implement distributed tracing with Jaeger.
# Jaeger configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:latest
ports:
- containerPort: 16686
- containerPort: 14268
env:
- name: COLLECTOR_OTLP_ENABLED
value: "true"
- name: SPAN_STORAGE_TYPE
value: "elasticsearch"
- name: ES_SERVER_URLS
value: "http://elasticsearch-master:9200"
volumeMounts:
- name: jaeger-storage
mountPath: /tmp
volumes:
- name: jaeger-storage
persistentVolumeClaim:
claimName: jaeger-pvc
---
apiVersion: v1
kind: Service
metadata:
name: jaeger
spec:
selector:
app: jaeger
ports:
- name: http-query
port: 16686
targetPort: 16686
- name: http-collector
port: 14268
targetPort: 14268
type: ClusterIP
Conclusion
Running Kubernetes in production requires a comprehensive approach that covers security, resource management, monitoring, backup, and performance optimization. By implementing these best practices, you can ensure your Kubernetes cluster is secure, reliable, and performant.
Remember that Kubernetes is a complex system, and production environments require ongoing maintenance, monitoring, and optimization. Regular reviews of your configuration and practices will help you maintain a robust and efficient production environment.
Key Takeaways
- Security First: Implement RBAC, network policies, and pod security standards
- Resource Management: Always define resource requests and limits, use HPA/VPA
- Monitoring: Comprehensive monitoring with Prometheus, Grafana, and alerting
- Backup: Automated backups with Velero and etcd snapshots
- Performance: Use affinity rules, resource quotas, and optimization techniques
- Observability: Centralized logging and distributed tracing
- Maintenance: Regular updates, security patches, and performance tuning
Start with these practices and adapt them to your specific requirements. A well-configured production Kubernetes cluster will provide the foundation for reliable, scalable applications.
The Future of Cloud Infrastructure: What's Next?
Exploring emerging trends and technologies that will shape the future of cloud infrastructure and DevOps practices
Complete Kubernetes Cluster Management on Alibaba Cloud with IaC
Learn how to build production-ready Kubernetes clusters on Alibaba Cloud using Infrastructure as Code, covering VPC setup, cluster deployment, monitoring, logging, ingress, and certificate management
Stay Updated
Get the latest DevOps insights and best practices delivered to your inbox
No spam, unsubscribe at any time