August 22, 2025

•

Kubernetes, Alibaba Cloud, DevOps, IaC

•

15 min read

Complete Kubernetes Cluster Management on Alibaba Cloud with IaC

Learn how to build production-ready Kubernetes clusters on Alibaba Cloud using Infrastructure as Code, covering VPC setup, cluster deployment, monitoring, logging, ingress, and certificate management

kubernetes alibaba-cloud terraform monitoring logging ingress cert-manager devops iac

Introduction

Alibaba Cloud Container Service for Kubernetes (ACK) provides a managed Kubernetes service that simplifies cluster management while maintaining full control over your containerized applications. In this comprehensive guide, we’ll explore how to build production-ready Kubernetes clusters on Alibaba Cloud using Infrastructure as Code (IaC) principles.

We’ll cover the complete journey from VPC creation to production deployment, including essential components like monitoring, logging, ingress controllers, and certificate management.

Why Alibaba Cloud for Kubernetes?

Alibaba Cloud offers several advantages for Kubernetes deployments:

Global Infrastructure: 84 availability zones across 27 regions
Cost Optimization: Competitive pricing with flexible billing options
Security: Enterprise-grade security with compliance certifications
Integration: Native integration with Alibaba Cloud services
Performance: High-performance networking and storage options

Infrastructure as Code Architecture

Our IaC approach will use Terraform to manage the complete infrastructure stack:

Mermaid Diagram

Rendering diagram...

Prerequisites

Before we begin, ensure you have the following tools installed:

# Install Terraform
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs)"
sudo apt-get update && sudo apt-get install terraform

# Install Alibaba Cloud CLI
curl -o aliyun-cli-linux-latest-amd64.tgz https://aliyuncli.alicdn.com/aliyun-cli-linux-latest-amd64.tgz
tar xzvf aliyun-cli-linux-latest-amd64.tgz
sudo mv aliyun /usr/local/bin/

# Install kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

# Install Helm
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

Step 1: VPC and Networking Infrastructure

Let’s start by creating the foundational networking infrastructure using Terraform:

# vpc.tf
terraform {
  required_providers {
    alicloud = {
      source  = "aliyun/alicloud"
      version = "~> 1.200"
    }
  }
  required_version = ">= 1.0"
}

provider "alicloud" {
  region = var.region
}

# VPC Configuration
resource "alicloud_vpc" "main" {
  vpc_name   = "${var.project_name}-vpc"
  cidr_block = var.vpc_cidr
  tags = {
    Environment = var.environment
    Project     = var.project_name
    ManagedBy   = "terraform"
  }
}

# VSwitches for different zones
resource "alicloud_vswitch" "private" {
  count             = length(var.availability_zones)
  vpc_id            = alicloud_vpc.main.id
  cidr_block        = var.private_subnet_cidrs[count.index]
  availability_zone = var.availability_zones[count.index]
  vswitch_name      = "${var.project_name}-private-${count.index + 1}"

  tags = {
    Environment = var.environment
    Project     = var.project_name
    Type        = "private"
  }
}

resource "alicloud_vswitch" "public" {
  count             = length(var.availability_zones)
  vpc_id            = alicloud_vpc.main.id
  cidr_block        = var.public_subnet_cidrs[count.index]
  availability_zone = var.availability_zones[count.index]
  vswitch_name      = "${var.project_name}-public-${count.index + 1}"

  tags = {
    Environment = var.environment
    Project     = var.project_name
    Type        = "public"
  }
}

# NAT Gateway for private subnets
resource "alicloud_nat_gateway" "main" {
  vpc_id        = alicloud_vpc.main.id
  nat_gateway_name = "${var.project_name}-nat"
  vswitch_id    = alicloud_vswitch.public[0].id

  tags = {
    Environment = var.environment
    Project     = var.project_name
  }
}

# EIP for NAT Gateway
resource "alicloud_eip" "nat" {
  address_name = "${var.project_name}-nat-eip"
  tags = {
    Environment = var.environment
    Project     = var.project_name
  }
}

resource "alicloud_eip_association" "nat" {
  allocation_id = alicloud_eip.nat.id
  instance_id   = alicloud_nat_gateway.main.id
}

# Route tables
resource "alicloud_route_table" "private" {
  vpc_id           = alicloud_vpc.main.id
  route_table_name = "${var.project_name}-private-rt"

  tags = {
    Environment = var.environment
    Project     = var.project_name
  }
}

resource "alicloud_route_table_entry" "private_nat" {
  route_table_id = alicloud_route_table.private.id
  destination_cidrblock = "0.0.0.0/0"
  nexthop_type          = "NatGateway"
  nexthop_id            = alicloud_nat_gateway.main.id
}

# variables.tf
variable "region" {
  description = "Alibaba Cloud region"
  type        = string
  default     = "cn-hangzhou"
}

variable "project_name" {
  description = "Project name for resource naming"
  type        = string
  default     = "k8s-cluster"
}

variable "environment" {
  description = "Environment name"
  type        = string
  default     = "production"
}

variable "vpc_cidr" {
  description = "VPC CIDR block"
  type        = string
  default     = "10.0.0.0/16"
}

variable "availability_zones" {
  description = "Availability zones"
  type        = list(string)
  default     = ["cn-hangzhou-a", "cn-hangzhou-b", "cn-hangzhou-c"]
}

variable "private_subnet_cidrs" {
  description = "Private subnet CIDR blocks"
  type        = list(string)
  default     = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
}

variable "public_subnet_cidrs" {
  description = "Public subnet CIDR blocks"
  type        = list(string)
  default     = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
}

Step 2: ACK Cluster Creation

Now let’s create the ACK cluster with proper security configurations:

# cluster.tf
# ACK Cluster
resource "alicloud_cs_managed_kubernetes" "main" {
  name                  = "${var.project_name}-cluster"
  version              = var.kubernetes_version
  vpc_id               = alicloud_vpc.main.id
  vswitch_ids          = alicloud_vswitch.private[*].id
  new_nat_gateway      = false
  nat_gateway_ids      = [alicloud_nat_gateway.main.id]

  # Security configurations
  pod_cidr             = var.pod_cidr
  service_cidr         = var.service_cidr
  node_cidr_mask       = var.node_cidr_mask

  # Cluster addons
  addons {
    name = "flannel"
  }

  addons {
    name = "nginx-ingress-controller"
  }

  addons {
    name = "alicloud-disk-controller"
  }

  # Logging configuration
  log_config {
    type = "SLS"
    project = alicloud_log_project.main.name
  }

  tags = {
    Environment = var.environment
    Project     = var.project_name
    ManagedBy   = "terraform"
  }
}

# Node Pool for worker nodes
resource "alicloud_cs_kubernetes_node_pool" "worker" {
  name                 = "${var.project_name}-worker-pool"
  cluster_id           = alicloud_cs_managed_kubernetes.main.id
  vswitch_ids          = alicloud_vswitch.private[*].id
  instance_types       = var.worker_instance_types
  system_disk_category = "cloud_essd"
  system_disk_size     = 100

  # Node configuration
  node_count = var.worker_node_count

  # Scaling configuration
  scaling_config {
    min_size = var.worker_min_size
    max_size = var.worker_max_size
  }

  # Security group
  security_group_ids = [alicloud_security_group.worker.id]

  tags = {
    Environment = var.environment
    Project     = var.project_name
    Role        = "worker"
  }
}

# Security Group for worker nodes
resource "alicloud_security_group" "worker" {
  name        = "${var.project_name}-worker-sg"
  vpc_id      = alicloud_vpc.main.id
  description = "Security group for Kubernetes worker nodes"

  tags = {
    Environment = var.environment
    Project     = var.project_name
  }
}

# Security group rules
resource "alicloud_security_group_rule" "worker_https" {
  type              = "ingress"
  ip_protocol       = "tcp"
  nic_type          = "intranet"
  policy            = "accept"
  port_range        = "443/443"
  priority          = 1
  security_group_id = alicloud_security_group.worker.id
  cidr_ip           = "0.0.0.0/0"
}

resource "alicloud_security_group_rule" "worker_http" {
  type              = "ingress"
  ip_protocol       = "tcp"
  nic_type          = "intranet"
  policy            = "accept"
  port_range        = "80/80"
  priority          = 1
  security_group_id = alicloud_security_group.worker.id
  cidr_ip           = "0.0.0.0/0"
}

# Log Project for cluster logging
resource "alicloud_log_project" "main" {
  name        = "${var.project_name}-logs"
  description = "Log project for Kubernetes cluster"

  tags = {
    Environment = var.environment
    Project     = var.project_name
  }
}

Step 3: Production-Ready Components Installation

3.1 Monitoring Stack (Prometheus + Grafana)

# monitoring-values.yaml
prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 100Gi

    additionalScrapeConfigs:
      - job_name: "kubernetes-service-endpoints"
        kubernetes_sd_configs:
          - role: endpoints
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            action: keep
            regex: true

grafana:
  adminPassword: "your-secure-password"
  persistence:
    enabled: true
    storageClassName: fast-ssd
    size: 10Gi

  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: "default"
          orgId: 1
          folder: ""
          type: file
          disableDeletion: false
          editable: true
          options:
            path: /var/lib/grafana/dashboards/default

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi

# Install monitoring stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

kubectl create namespace monitoring
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values monitoring-values.yaml

3.2 Logging Stack (EFK)

# logging-values.yaml
elasticsearch:
  replicas: 3
  resources:
    requests:
      memory: "2Gi"
      cpu: "500m"
    limits:
      memory: "4Gi"
      cpu: "1000m"

  volumeClaimTemplate:
    spec:
      storageClassName: fast-ssd
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi

kibana:
  resources:
    requests:
      memory: "512Mi"
      cpu: "250m"
    limits:
      memory: "1Gi"
      cpu: "500m"

fluentd:
  resources:
    requests:
      memory: "256Mi"
      cpu: "100m"
    limits:
      memory: "512Mi"
      cpu: "200m"

  config:
    containers:
      input: |
        <source>
          @type tail
          path /var/log/containers/*.log
          pos_file /var/log/fluentd-containers.log.pos
          tag kubernetes.*
          read_from_head true
          <parse>
            @type json
            time_format %Y-%m-%dT%H:%M:%S.%NZ
          </parse>
        </source>

# Install logging stack
helm repo add elastic https://helm.elastic.co
helm repo update

kubectl create namespace logging
helm install elasticsearch elastic/elasticsearch \
  --namespace logging \
  --values logging-values.yaml

helm install kibana elastic/kibana \
  --namespace logging \
  --values logging-values.yaml

helm install fluentd elastic/fluentd \
  --namespace logging \
  --values logging-values.yaml

3.3 Ingress Controller (NGINX)

# ingress-values.yaml
controller:
  replicaCount: 3
  resources:
    requests:
      memory: "256Mi"
      cpu: "100m"
    limits:
      memory: "512Mi"
      cpu: "200m"

  service:
    type: LoadBalancer
    annotations:
      service.beta.kubernetes.io/alibaba-cloud-loadbalancer-instance-charge-type: "PayBySpec"
      service.beta.kubernetes.io/alibaba-cloud-loadbalancer-spec: "slb.s1.small"

  config:
    use-proxy-protocol: "true"
    proxy-real-ip-cidr: "0.0.0.0/0"
    use-forwarded-headers: "true"

  admissionWebhooks:
    enabled: true
    patch:
      enabled: true
      image:
        tag: v1.8.1

defaultBackend:
  resources:
    requests:
      memory: "64Mi"
      cpu: "50m"
    limits:
      memory: "128Mi"
      cpu: "100m"

# Install NGINX Ingress Controller
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

kubectl create namespace ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --values ingress-values.yaml

3.4 Certificate Manager

# cert-manager-values.yaml
installCRDs: true
replicaCount: 3

resources:
  requests:
    memory: "128Mi"
    cpu: "100m"
  limits:
    memory: "256Mi"
    cpu: "200m"

prometheus:
  enabled: true
  servicemonitor:
    enabled: true

# ClusterIssuer for Let's Encrypt
extraArgs:
  - --dns01-recursive-nameservers=8.8.8.8:53,1.1.1.1:53

# Install cert-manager
helm repo add jetstack https://charts.jetstack.io
helm repo update

kubectl create namespace cert-manager
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --values cert-manager-values.yaml

# Create ClusterIssuer for Let's Encrypt
kubectl apply -f - <<EOF
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx
EOF

Step 4: Application Deployment Example

Let’s deploy a sample application to demonstrate the complete setup:

# sample-app.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: sample-app
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
  namespace: sample-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sample-app
  template:
    metadata:
      labels:
        app: sample-app
    spec:
      containers:
        - name: app
          image: nginx:alpine
          ports:
            - containerPort: 80
          resources:
            requests:
              memory: "64Mi"
              cpu: "50m"
            limits:
              memory: "128Mi"
              cpu: "100m"
          livenessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 5
            periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: sample-app-service
  namespace: sample-app
spec:
  selector:
    app: sample-app
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: sample-app-ingress
  namespace: sample-app
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  tls:
    - hosts:
        - your-domain.com
      secretName: sample-app-tls
  rules:
    - host: your-domain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: sample-app-service
                port:
                  number: 80

Step 5: Security and Compliance

5.1 Network Policies

# network-policies.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: sample-app
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-app-traffic
  namespace: sample-app
spec:
  podSelector:
    matchLabels:
      app: sample-app
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - protocol: TCP
          port: 80

5.2 RBAC Configuration

# rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-role
rules:
  - apiGroups: [""]
    resources: ["pods", "services", "endpoints", "nodes"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["extensions", "networking.k8s.io"]
    resources: ["ingresses"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: monitoring-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: monitoring-role
subjects:
  - kind: ServiceAccount
    name: prometheus-kube-prometheus-prometheus
    namespace: monitoring

Step 6: Backup and Disaster Recovery

6.1 Velero for Backup

# Install Velero
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update

kubectl create namespace velero
helm install velero vmware-tanzu/velero \
  --namespace velero \
  --set configuration.provider=alibabacloud \
  --set configuration.backupStorageLocation.name=default \
  --set configuration.backupStorageLocation.bucket=your-backup-bucket \
  --set configuration.volumeSnapshotLocation.name=default \
  --set configuration.volumeSnapshotLocation.config.region=cn-hangzhou \
  --set credentials.secretContents.cloud=|
    [default]
    alicloud_access_key_id=your-access-key
    alicloud_secret_access_key=your-secret-key

6.2 Backup Schedule

# backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"
  template:
    includedNamespaces:
      - sample-app
      - monitoring
      - logging
    ttl: "720h"
    storageLocation: default
    volumeSnapshotLocations:
      - default

Step 7: Monitoring and Alerting

7.1 Custom Grafana Dashboards

{
  "dashboard": {
    "title": "Kubernetes Cluster Overview",
    "panels": [
      {
        "title": "Node CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{instance}}"
          }
        ]
      },
      {
        "title": "Pod Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "sum by (pod) (container_memory_usage_bytes)",
            "legendFormat": "{{pod}}"
          }
        ]
      }
    ]
  }
}

7.2 Alert Rules

# alert-rules.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: PrometheusRule
metadata:
  name: kubernetes-alerts
  namespace: monitoring
spec:
  groups:
    - name: kubernetes.rules
      rules:
        - alert: HighNodeCPU
          expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High CPU usage on {{ $labels.instance }}"
            description: "CPU usage is above 80% for 5 minutes"

        - alert: HighNodeMemory
          expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High memory usage on {{ $labels.instance }}"
            description: "Memory usage is above 85% for 5 minutes"

Step 8: Cost Optimization

8.1 Node Autoscaling

# cluster-autoscaler-values.yaml
autoDiscovery:
  clusterName: your-cluster-id

awsRegion: cn-hangzhou

rbac:
  serviceAccount:
    create: true
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/AmazonEKSClusterAutoscalerRole

extraArgs:
  - --scale-down-delay-after-add=10m
  - --scale-down-unneeded-time=10m
  - --max-node-provision-time=15m

8.2 Resource Quotas

# resource-quotas.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: sample-app
spec:
  hard:
    requests.cpu: "4"
    requests.memory: 8Gi
    limits.cpu: "8"
    limits.memory: 16Gi
    pods: "10"

Deployment Commands

Here’s the complete deployment sequence:

# 1. Initialize Terraform
terraform init
terraform plan
terraform apply

# 2. Configure kubectl
alicloud cs get-user-config --cluster-id $(terraform output -raw cluster_id)

# 3. Install monitoring stack
kubectl create namespace monitoring
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values monitoring-values.yaml

# 4. Install logging stack
kubectl create namespace logging
helm install elasticsearch elastic/elasticsearch \
  --namespace logging \
  --values logging-values.yaml

# 5. Install ingress controller
kubectl create namespace ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --values ingress-values.yaml

# 6. Install cert-manager
kubectl create namespace cert-manager
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --values cert-manager-values.yaml

# 7. Deploy sample application
kubectl apply -f sample-app.yaml

# 8. Apply security policies
kubectl apply -f network-policies.yaml
kubectl apply -f rbac.yaml

# 9. Setup backup
helm install velero vmware-tanzu/velero \
  --namespace velero \
  --values velero-values.yaml

# 10. Apply resource quotas
kubectl apply -f resource-quotas.yaml

Monitoring and Maintenance

Regular Maintenance Tasks

Security Updates: Regularly update Kubernetes and component versions
Backup Verification: Test backup restoration procedures monthly
Performance Monitoring: Monitor cluster performance and optimize resources
Log Analysis: Review logs for security incidents and performance issues
Certificate Renewal: Monitor certificate expiration dates

Useful Commands

# Check cluster health
kubectl get nodes
kubectl get pods --all-namespaces

# Monitor resource usage
kubectl top nodes
kubectl top pods --all-namespaces

# Check ingress status
kubectl get ingress --all-namespaces

# Monitor certificates
kubectl get certificates --all-namespaces
kubectl get certificaterequests --all-namespaces

# Check backup status
velero backup get
velero schedule get

# Monitor cluster autoscaler
kubectl logs -n kube-system deployment/cluster-autoscaler

Conclusion

This comprehensive guide demonstrates how to build a production-ready Kubernetes cluster on Alibaba Cloud using Infrastructure as Code principles. The setup includes:

Infrastructure as Code: Complete Terraform configuration for VPC, networking, and cluster
Production Components: Monitoring, logging, ingress, and certificate management
Security: Network policies, RBAC, and security groups
Backup & Recovery: Automated backup with Velero
Cost Optimization: Autoscaling and resource quotas
Monitoring: Comprehensive monitoring and alerting

The architecture is designed to be scalable, secure, and maintainable, following DevOps best practices for production Kubernetes deployments.

Next Steps

Customize: Adapt the configuration for your specific requirements
Security: Implement additional security measures like pod security policies
CI/CD: Set up automated deployment pipelines
Testing: Implement comprehensive testing strategies
Documentation: Create runbooks and operational procedures

Remember to regularly review and update your infrastructure as your requirements evolve and new best practices emerge in the Kubernetes ecosystem.

Aug 22, 2025

Complete Kubernetes Monitoring Guide for DevOps Engineers

Learn how to implement comprehensive monitoring and observability for Kubernetes clusters using Prometheus, Grafana, and modern DevOps practices