Complete Kubernetes Cluster Management on Alibaba Cloud with IaC
Learn how to build production-ready Kubernetes clusters on Alibaba Cloud using Infrastructure as Code, covering VPC setup, cluster deployment, monitoring, logging, ingress, and certificate management
Introduction
Alibaba Cloud Container Service for Kubernetes (ACK) provides a managed Kubernetes service that simplifies cluster management while maintaining full control over your containerized applications. In this comprehensive guide, we’ll explore how to build production-ready Kubernetes clusters on Alibaba Cloud using Infrastructure as Code (IaC) principles.
We’ll cover the complete journey from VPC creation to production deployment, including essential components like monitoring, logging, ingress controllers, and certificate management.
Why Alibaba Cloud for Kubernetes?
Alibaba Cloud offers several advantages for Kubernetes deployments:
- Global Infrastructure: 84 availability zones across 27 regions
- Cost Optimization: Competitive pricing with flexible billing options
- Security: Enterprise-grade security with compliance certifications
- Integration: Native integration with Alibaba Cloud services
- Performance: High-performance networking and storage options
Infrastructure as Code Architecture
Our IaC approach will use Terraform to manage the complete infrastructure stack:
Prerequisites
Before we begin, ensure you have the following tools installed:
# Install Terraform
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs)"
sudo apt-get update && sudo apt-get install terraform
# Install Alibaba Cloud CLI
curl -o aliyun-cli-linux-latest-amd64.tgz https://aliyuncli.alicdn.com/aliyun-cli-linux-latest-amd64.tgz
tar xzvf aliyun-cli-linux-latest-amd64.tgz
sudo mv aliyun /usr/local/bin/
# Install kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
# Install Helm
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
Step 1: VPC and Networking Infrastructure
Let’s start by creating the foundational networking infrastructure using Terraform:
# vpc.tf
terraform {
required_providers {
alicloud = {
source = "aliyun/alicloud"
version = "~> 1.200"
}
}
required_version = ">= 1.0"
}
provider "alicloud" {
region = var.region
}
# VPC Configuration
resource "alicloud_vpc" "main" {
vpc_name = "${var.project_name}-vpc"
cidr_block = var.vpc_cidr
tags = {
Environment = var.environment
Project = var.project_name
ManagedBy = "terraform"
}
}
# VSwitches for different zones
resource "alicloud_vswitch" "private" {
count = length(var.availability_zones)
vpc_id = alicloud_vpc.main.id
cidr_block = var.private_subnet_cidrs[count.index]
availability_zone = var.availability_zones[count.index]
vswitch_name = "${var.project_name}-private-${count.index + 1}"
tags = {
Environment = var.environment
Project = var.project_name
Type = "private"
}
}
resource "alicloud_vswitch" "public" {
count = length(var.availability_zones)
vpc_id = alicloud_vpc.main.id
cidr_block = var.public_subnet_cidrs[count.index]
availability_zone = var.availability_zones[count.index]
vswitch_name = "${var.project_name}-public-${count.index + 1}"
tags = {
Environment = var.environment
Project = var.project_name
Type = "public"
}
}
# NAT Gateway for private subnets
resource "alicloud_nat_gateway" "main" {
vpc_id = alicloud_vpc.main.id
nat_gateway_name = "${var.project_name}-nat"
vswitch_id = alicloud_vswitch.public[0].id
tags = {
Environment = var.environment
Project = var.project_name
}
}
# EIP for NAT Gateway
resource "alicloud_eip" "nat" {
address_name = "${var.project_name}-nat-eip"
tags = {
Environment = var.environment
Project = var.project_name
}
}
resource "alicloud_eip_association" "nat" {
allocation_id = alicloud_eip.nat.id
instance_id = alicloud_nat_gateway.main.id
}
# Route tables
resource "alicloud_route_table" "private" {
vpc_id = alicloud_vpc.main.id
route_table_name = "${var.project_name}-private-rt"
tags = {
Environment = var.environment
Project = var.project_name
}
}
resource "alicloud_route_table_entry" "private_nat" {
route_table_id = alicloud_route_table.private.id
destination_cidrblock = "0.0.0.0/0"
nexthop_type = "NatGateway"
nexthop_id = alicloud_nat_gateway.main.id
}
# variables.tf
variable "region" {
description = "Alibaba Cloud region"
type = string
default = "cn-hangzhou"
}
variable "project_name" {
description = "Project name for resource naming"
type = string
default = "k8s-cluster"
}
variable "environment" {
description = "Environment name"
type = string
default = "production"
}
variable "vpc_cidr" {
description = "VPC CIDR block"
type = string
default = "10.0.0.0/16"
}
variable "availability_zones" {
description = "Availability zones"
type = list(string)
default = ["cn-hangzhou-a", "cn-hangzhou-b", "cn-hangzhou-c"]
}
variable "private_subnet_cidrs" {
description = "Private subnet CIDR blocks"
type = list(string)
default = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
}
variable "public_subnet_cidrs" {
description = "Public subnet CIDR blocks"
type = list(string)
default = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
}
Step 2: ACK Cluster Creation
Now let’s create the ACK cluster with proper security configurations:
# cluster.tf
# ACK Cluster
resource "alicloud_cs_managed_kubernetes" "main" {
name = "${var.project_name}-cluster"
version = var.kubernetes_version
vpc_id = alicloud_vpc.main.id
vswitch_ids = alicloud_vswitch.private[*].id
new_nat_gateway = false
nat_gateway_ids = [alicloud_nat_gateway.main.id]
# Security configurations
pod_cidr = var.pod_cidr
service_cidr = var.service_cidr
node_cidr_mask = var.node_cidr_mask
# Cluster addons
addons {
name = "flannel"
}
addons {
name = "nginx-ingress-controller"
}
addons {
name = "alicloud-disk-controller"
}
# Logging configuration
log_config {
type = "SLS"
project = alicloud_log_project.main.name
}
tags = {
Environment = var.environment
Project = var.project_name
ManagedBy = "terraform"
}
}
# Node Pool for worker nodes
resource "alicloud_cs_kubernetes_node_pool" "worker" {
name = "${var.project_name}-worker-pool"
cluster_id = alicloud_cs_managed_kubernetes.main.id
vswitch_ids = alicloud_vswitch.private[*].id
instance_types = var.worker_instance_types
system_disk_category = "cloud_essd"
system_disk_size = 100
# Node configuration
node_count = var.worker_node_count
# Scaling configuration
scaling_config {
min_size = var.worker_min_size
max_size = var.worker_max_size
}
# Security group
security_group_ids = [alicloud_security_group.worker.id]
tags = {
Environment = var.environment
Project = var.project_name
Role = "worker"
}
}
# Security Group for worker nodes
resource "alicloud_security_group" "worker" {
name = "${var.project_name}-worker-sg"
vpc_id = alicloud_vpc.main.id
description = "Security group for Kubernetes worker nodes"
tags = {
Environment = var.environment
Project = var.project_name
}
}
# Security group rules
resource "alicloud_security_group_rule" "worker_https" {
type = "ingress"
ip_protocol = "tcp"
nic_type = "intranet"
policy = "accept"
port_range = "443/443"
priority = 1
security_group_id = alicloud_security_group.worker.id
cidr_ip = "0.0.0.0/0"
}
resource "alicloud_security_group_rule" "worker_http" {
type = "ingress"
ip_protocol = "tcp"
nic_type = "intranet"
policy = "accept"
port_range = "80/80"
priority = 1
security_group_id = alicloud_security_group.worker.id
cidr_ip = "0.0.0.0/0"
}
# Log Project for cluster logging
resource "alicloud_log_project" "main" {
name = "${var.project_name}-logs"
description = "Log project for Kubernetes cluster"
tags = {
Environment = var.environment
Project = var.project_name
}
}
Step 3: Production-Ready Components Installation
3.1 Monitoring Stack (Prometheus + Grafana)
# monitoring-values.yaml
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
additionalScrapeConfigs:
- job_name: "kubernetes-service-endpoints"
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
grafana:
adminPassword: "your-secure-password"
persistence:
enabled: true
storageClassName: fast-ssd
size: 10Gi
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: "default"
orgId: 1
folder: ""
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/default
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
# Install monitoring stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values monitoring-values.yaml
3.2 Logging Stack (EFK)
# logging-values.yaml
elasticsearch:
replicas: 3
resources:
requests:
memory: "2Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "1000m"
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
kibana:
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
fluentd:
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "200m"
config:
containers:
input: |
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
# Install logging stack
helm repo add elastic https://helm.elastic.co
helm repo update
kubectl create namespace logging
helm install elasticsearch elastic/elasticsearch \
--namespace logging \
--values logging-values.yaml
helm install kibana elastic/kibana \
--namespace logging \
--values logging-values.yaml
helm install fluentd elastic/fluentd \
--namespace logging \
--values logging-values.yaml
3.3 Ingress Controller (NGINX)
# ingress-values.yaml
controller:
replicaCount: 3
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "200m"
service:
type: LoadBalancer
annotations:
service.beta.kubernetes.io/alibaba-cloud-loadbalancer-instance-charge-type: "PayBySpec"
service.beta.kubernetes.io/alibaba-cloud-loadbalancer-spec: "slb.s1.small"
config:
use-proxy-protocol: "true"
proxy-real-ip-cidr: "0.0.0.0/0"
use-forwarded-headers: "true"
admissionWebhooks:
enabled: true
patch:
enabled: true
image:
tag: v1.8.1
defaultBackend:
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "100m"
# Install NGINX Ingress Controller
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
kubectl create namespace ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx \
--values ingress-values.yaml
3.4 Certificate Manager
# cert-manager-values.yaml
installCRDs: true
replicaCount: 3
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
prometheus:
enabled: true
servicemonitor:
enabled: true
# ClusterIssuer for Let's Encrypt
extraArgs:
- --dns01-recursive-nameservers=8.8.8.8:53,1.1.1.1:53
# Install cert-manager
helm repo add jetstack https://charts.jetstack.io
helm repo update
kubectl create namespace cert-manager
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--values cert-manager-values.yaml
# Create ClusterIssuer for Let's Encrypt
kubectl apply -f - <<EOF
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: [email protected]
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx
EOF
Step 4: Application Deployment Example
Let’s deploy a sample application to demonstrate the complete setup:
# sample-app.yaml
apiVersion: v1
kind: Namespace
metadata:
name: sample-app
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-app
namespace: sample-app
spec:
replicas: 3
selector:
matchLabels:
app: sample-app
template:
metadata:
labels:
app: sample-app
spec:
containers:
- name: app
image: nginx:alpine
ports:
- containerPort: 80
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "100m"
livenessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 5
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: sample-app-service
namespace: sample-app
spec:
selector:
app: sample-app
ports:
- port: 80
targetPort: 80
protocol: TCP
type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: sample-app-ingress
namespace: sample-app
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
tls:
- hosts:
- your-domain.com
secretName: sample-app-tls
rules:
- host: your-domain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: sample-app-service
port:
number: 80
Step 5: Security and Compliance
5.1 Network Policies
# network-policies.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: sample-app
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-app-traffic
namespace: sample-app
spec:
podSelector:
matchLabels:
app: sample-app
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 80
5.2 RBAC Configuration
# rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: monitoring-role
rules:
- apiGroups: [""]
resources: ["pods", "services", "endpoints", "nodes"]
verbs: ["get", "list", "watch"]
- apiGroups: ["extensions", "networking.k8s.io"]
resources: ["ingresses"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: monitoring-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: monitoring-role
subjects:
- kind: ServiceAccount
name: prometheus-kube-prometheus-prometheus
namespace: monitoring
Step 6: Backup and Disaster Recovery
6.1 Velero for Backup
# Install Velero
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update
kubectl create namespace velero
helm install velero vmware-tanzu/velero \
--namespace velero \
--set configuration.provider=alibabacloud \
--set configuration.backupStorageLocation.name=default \
--set configuration.backupStorageLocation.bucket=your-backup-bucket \
--set configuration.volumeSnapshotLocation.name=default \
--set configuration.volumeSnapshotLocation.config.region=cn-hangzhou \
--set credentials.secretContents.cloud=|
[default]
alicloud_access_key_id=your-access-key
alicloud_secret_access_key=your-secret-key
6.2 Backup Schedule
# backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 2 * * *"
template:
includedNamespaces:
- sample-app
- monitoring
- logging
ttl: "720h"
storageLocation: default
volumeSnapshotLocations:
- default
Step 7: Monitoring and Alerting
7.1 Custom Grafana Dashboards
{
"dashboard": {
"title": "Kubernetes Cluster Overview",
"panels": [
{
"title": "Node CPU Usage",
"type": "graph",
"targets": [
{
"expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "Pod Memory Usage",
"type": "graph",
"targets": [
{
"expr": "sum by (pod) (container_memory_usage_bytes)",
"legendFormat": "{{pod}}"
}
]
}
]
}
}
7.2 Alert Rules
# alert-rules.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: PrometheusRule
metadata:
name: kubernetes-alerts
namespace: monitoring
spec:
groups:
- name: kubernetes.rules
rules:
- alert: HighNodeCPU
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for 5 minutes"
- alert: HighNodeMemory
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85% for 5 minutes"
Step 8: Cost Optimization
8.1 Node Autoscaling
# cluster-autoscaler-values.yaml
autoDiscovery:
clusterName: your-cluster-id
awsRegion: cn-hangzhou
rbac:
serviceAccount:
create: true
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/AmazonEKSClusterAutoscalerRole
extraArgs:
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --max-node-provision-time=15m
8.2 Resource Quotas
# resource-quotas.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: sample-app
spec:
hard:
requests.cpu: "4"
requests.memory: 8Gi
limits.cpu: "8"
limits.memory: 16Gi
pods: "10"
Deployment Commands
Here’s the complete deployment sequence:
# 1. Initialize Terraform
terraform init
terraform plan
terraform apply
# 2. Configure kubectl
alicloud cs get-user-config --cluster-id $(terraform output -raw cluster_id)
# 3. Install monitoring stack
kubectl create namespace monitoring
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values monitoring-values.yaml
# 4. Install logging stack
kubectl create namespace logging
helm install elasticsearch elastic/elasticsearch \
--namespace logging \
--values logging-values.yaml
# 5. Install ingress controller
kubectl create namespace ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx \
--values ingress-values.yaml
# 6. Install cert-manager
kubectl create namespace cert-manager
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--values cert-manager-values.yaml
# 7. Deploy sample application
kubectl apply -f sample-app.yaml
# 8. Apply security policies
kubectl apply -f network-policies.yaml
kubectl apply -f rbac.yaml
# 9. Setup backup
helm install velero vmware-tanzu/velero \
--namespace velero \
--values velero-values.yaml
# 10. Apply resource quotas
kubectl apply -f resource-quotas.yaml
Monitoring and Maintenance
Regular Maintenance Tasks
- Security Updates: Regularly update Kubernetes and component versions
- Backup Verification: Test backup restoration procedures monthly
- Performance Monitoring: Monitor cluster performance and optimize resources
- Log Analysis: Review logs for security incidents and performance issues
- Certificate Renewal: Monitor certificate expiration dates
Useful Commands
# Check cluster health
kubectl get nodes
kubectl get pods --all-namespaces
# Monitor resource usage
kubectl top nodes
kubectl top pods --all-namespaces
# Check ingress status
kubectl get ingress --all-namespaces
# Monitor certificates
kubectl get certificates --all-namespaces
kubectl get certificaterequests --all-namespaces
# Check backup status
velero backup get
velero schedule get
# Monitor cluster autoscaler
kubectl logs -n kube-system deployment/cluster-autoscaler
Conclusion
This comprehensive guide demonstrates how to build a production-ready Kubernetes cluster on Alibaba Cloud using Infrastructure as Code principles. The setup includes:
- Infrastructure as Code: Complete Terraform configuration for VPC, networking, and cluster
- Production Components: Monitoring, logging, ingress, and certificate management
- Security: Network policies, RBAC, and security groups
- Backup & Recovery: Automated backup with Velero
- Cost Optimization: Autoscaling and resource quotas
- Monitoring: Comprehensive monitoring and alerting
The architecture is designed to be scalable, secure, and maintainable, following DevOps best practices for production Kubernetes deployments.
Next Steps
- Customize: Adapt the configuration for your specific requirements
- Security: Implement additional security measures like pod security policies
- CI/CD: Set up automated deployment pipelines
- Testing: Implement comprehensive testing strategies
- Documentation: Create runbooks and operational procedures
Remember to regularly review and update your infrastructure as your requirements evolve and new best practices emerge in the Kubernetes ecosystem.
Kubernetes Best Practices for Production
Essential Kubernetes best practices for production environments, covering security, reliability, monitoring, and performance optimization
Building Scalable DevOps Infrastructure
A comprehensive guide to designing and implementing scalable DevOps infrastructure that grows with your organization
Stay Updated
Get the latest DevOps insights and best practices delivered to your inbox
No spam, unsubscribe at any time