September 15, 2025

•

kubernetes, devops, cloud-native

•

20 min read

Kubernetes Probes 解析：Liveness、Readiness 和 Startup Probes

深入分析 Kubernetes 三种探针（Liveness、Readiness、Startup）的区别、应用场景、检测成功及失败后的行为，包含最佳实践和实际配置示例

kubernetes devops container-orchestration production monitoring health-checks

Kubernetes Probes 解析：Liveness、Readiness 和 Startup Probes

Kubernetes 探针（Probes）是确保容器健康运行的关键机制。正确配置和使用 Liveness、Readiness 和 Startup 探针对于生产环境的稳定性和可靠性至关重要。本文深入分析这三种探针的区别、应用场景以及检测成功和失败后的行为。

探针概述

Kubernetes 提供了三种类型的探针来监控容器的健康状态：

Liveness Probe（存活探针）：检测容器是否正在运行
Readiness Probe（就绪探针）：检测容器是否准备好接收流量
Startup Probe（启动探针）：检测容器应用是否已经启动

这三种探针可以独立配置，也可以组合使用，以满足不同应用场景的需求。

Liveness Probe 详解

定义与目的

Liveness Probe 用于检测容器是否仍在正常运行。如果容器处于运行状态但应用已经死锁或无法响应，Liveness Probe 会检测到这种情况并触发容器重启。

检测失败后的行为

当 Liveness Probe 失败时：

Kubernetes 会杀死容器
根据 Pod 的 restartPolicy 重启容器
如果 restartPolicy 为 Always 或 OnFailure，容器会被重启
如果 restartPolicy 为 Never，容器不会被重启

适用场景

应用可能进入死锁状态，需要重启恢复
应用可能因为内存泄漏或其他问题导致无响应
需要确保应用始终处于可用状态

配置示例

apiVersion: v1
kind: Pod
metadata:
  name: liveness-example
spec:
  containers:
  - name: app
    image: myapp:latest
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 30  # 容器启动后30秒开始检测
      periodSeconds: 10        # 每10秒检测一次
      timeoutSeconds: 5       # 检测超时时间5秒
      successThreshold: 1      # 连续1次成功视为成功
      failureThreshold: 3      # 连续3次失败视为失败

Readiness Probe 详解

定义与目的

Readiness Probe 用于检测容器是否准备好接收流量。只有当 Readiness Probe 成功时，Pod 才会被添加到 Service 的端点列表中，从而接收流量。

检测失败后的行为

当 Readiness Probe 失败时：

Pod 的 Ready 状态变为 False
Pod 从 Service 的端点列表中移除
不会重启容器
流量不再路由到该 Pod
如果 Pod 是 Deployment 的一部分，可能会创建新的 Pod 来替代

适用场景

应用启动需要较长时间加载配置或数据
应用需要连接到外部依赖（数据库、缓存等）才能提供服务
应用在运行时可能暂时无法处理请求（如正在执行维护任务）
需要优雅地处理流量，避免将请求发送到未就绪的 Pod

配置示例

apiVersion: v1
kind: Pod
metadata:
  name: readiness-example
spec:
  containers:
  - name: app
    image: myapp:latest
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 3
      successThreshold: 1
      failureThreshold: 3

Startup Probe 详解

定义与目的

Startup Probe 用于检测容器应用是否已经启动完成。它主要用于处理启动时间较长的应用，避免在应用启动期间 Liveness 和 Readiness 探针的误报。

检测失败后的行为

当 Startup Probe 失败时：

不会重启容器（与 Liveness Probe 不同）
如果 Startup Probe 在 failureThreshold * periodSeconds 时间内未成功，容器会被杀死并重启
在 Startup Probe 成功之前，Liveness 和 Readiness 探针不会执行

适用场景

应用启动时间较长（超过 initialDelaySeconds）
需要避免启动期间的误报
应用启动时需要执行初始化任务（加载数据、建立连接等）
与 Liveness 和 Readiness 探针配合使用，提供更精确的健康检查

配置示例

apiVersion: v1
kind: Pod
metadata:
  name: startup-example
spec:
  containers:
  - name: app
    image: myapp:latest
    startupProbe:
      httpGet:
        path: /startup
        port: 8080
      initialDelaySeconds: 0
      periodSeconds: 10
      timeoutSeconds: 3
      successThreshold: 1
      failureThreshold: 30  # 最多等待5分钟（30 * 10秒）
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 0  # Startup 成功后立即开始
      periodSeconds: 10
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 0  # Startup 成功后立即开始
      periodSeconds: 5

三种探针对比

功能对比表

特性	Liveness Probe	Readiness Probe	Startup Probe
目的	检测容器是否存活	检测容器是否就绪	检测容器是否启动完成
失败后行为	重启容器	从 Service 移除	等待后可能重启
影响流量	不影响（重启期间）	立即停止接收流量	不影响
执行时机	持续执行	持续执行	仅在启动阶段
典型用途	检测死锁、无响应	检测依赖就绪、维护状态	处理慢启动应用

执行顺序

容器启动
Startup Probe 开始执行（如果配置）
- 成功：进入下一步
- 失败：继续检测，直到成功或超时
Startup Probe 成功后，Liveness 和 Readiness 探针开始执行
Liveness Probe 持续检测容器存活状态
Readiness Probe 持续检测容器就绪状态

应用场景分析

场景 1：Web 应用（标准配置）

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: web
        image: nginx:latest
        ports:
        - containerPort: 80
        livenessProbe:
          httpGet:
            path: /healthz
            port: 80
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 5

分析：

Liveness Probe 检测应用是否正常运行，失败时重启
Readiness Probe 确保只有就绪的 Pod 接收流量
适合大多数 Web 应用场景

场景 2：慢启动应用（使用 Startup Probe）

apiVersion: apps/v1
kind: Deployment
metadata:
  name: slow-start-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: slow-app:latest
        startupProbe:
          httpGet:
            path: /startup
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 10
          failureThreshold: 30  # 最多等待5分钟
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 5

分析：

Startup Probe 处理启动时间较长的应用
避免启动期间的误报
Liveness 和 Readiness 在 Startup 成功后开始执行

场景 3：数据库连接依赖

apiVersion: apps/v1
kind: Deployment
metadata:
  name: db-dependent-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: app:latest
        env:
        - name: DB_HOST
          value: "postgres-service"
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "pg_isready -h $DB_HOST || exit 1"
          initialDelaySeconds: 10
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10

分析：

Readiness Probe 检查数据库连接
只有数据库连接成功后才接收流量
Liveness Probe 检测应用整体健康状态

场景 4：批处理任务

apiVersion: batch/v1
kind: Job
metadata:
  name: batch-job
spec:
  template:
    spec:
      containers:
      - name: processor
        image: batch-processor:latest
        # 批处理任务通常不需要 Readiness Probe
        # 因为它们不接收外部流量
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "ps aux | grep -v grep | grep processor || exit 1"
          initialDelaySeconds: 60
          periodSeconds: 30
      restartPolicy: OnFailure

分析：

批处理任务不需要 Readiness Probe（不接收流量）
Liveness Probe 检测任务进程是否运行
适合长时间运行的批处理任务

检测成功与失败后的行为

Liveness Probe 行为流程

Mermaid Diagram

Rendering diagram...

关键行为：

✅ 成功：容器继续运行，重置失败计数
❌ 失败：增加失败计数，达到阈值后重启容器
🔄 重启后：重新开始检测流程

Readiness Probe 行为流程

Mermaid Diagram

Rendering diagram...

关键行为：

✅ 成功：Pod 标记为 Ready，添加到 Service 端点，开始接收流量
❌ 失败：Pod 标记为 Not Ready，从 Service 端点移除，停止接收流量
⚠️ 不重启容器：与 Liveness Probe 不同，失败不会导致重启

Startup Probe 行为流程

Mermaid Diagram

Rendering diagram...

关键行为：

✅ 成功：启动完成，Liveness 和 Readiness 探针开始执行
❌ 失败：继续检测，直到成功或超时
⏱️ 超时：如果超过 failureThreshold * periodSeconds，容器被杀死并重启

最佳实践

1. 探针类型选择

所有生产应用：应该配置 Liveness 和 Readiness Probe
慢启动应用：添加 Startup Probe
批处理任务：只需要 Liveness Probe

2. 检测方式选择

HTTP Get：适合 Web 应用和服务
TCP Socket：适合非 HTTP 服务
Exec：适合需要复杂检测逻辑的场景

3. 时间参数配置

# 推荐配置
livenessProbe:
  initialDelaySeconds: 30    # 给应用足够的启动时间
  periodSeconds: 10          # 检测频率适中
  timeoutSeconds: 5          # 超时时间合理
  successThreshold: 1        # 通常为 1
  failureThreshold: 3        # 避免短暂故障导致重启

readinessProbe:
  initialDelaySeconds: 5     # 比 Liveness 更短
  periodSeconds: 5           # 更频繁的检测
  timeoutSeconds: 3          # 更短的超时
  successThreshold: 1
  failureThreshold: 3

startupProbe:
  initialDelaySeconds: 0     # 立即开始
  periodSeconds: 10         # 检测间隔
  timeoutSeconds: 3
  successThreshold: 1
  failureThreshold: 30      # 允许更长的启动时间

4. 端点设计

健康检查端点应该：

轻量级，快速响应
不依赖外部服务（如数据库）
返回明确的 HTTP 状态码
避免在检测逻辑中执行耗时操作

示例端点实现：

// Go 示例
func healthzHandler(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("OK"))
}

func readyHandler(w http.ResponseWriter, r *http.Request) {
    // 检查关键依赖
    if db.IsConnected() && cache.IsConnected() {
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("Ready"))
    } else {
        w.WriteHeader(http.StatusServiceUnavailable)
        w.Write([]byte("Not Ready"))
    }
}

5. 避免常见错误

❌ 错误 1：Readiness Probe 检查外部依赖

# 错误示例
readinessProbe:
  httpGet:
    path: /ready  # 如果这个端点检查数据库连接
    port: 8080    # 数据库故障会导致所有 Pod 不可用

✅ 正确做法：

# 正确示例
readinessProbe:
  httpGet:
    path: /ready  # 只检查应用本身是否就绪
    port: 8080
# 外部依赖故障应该通过 Liveness Probe 或监控系统处理

❌ 错误 2：Liveness Probe 检查外部依赖

# 错误示例
livenessProbe:
  exec:
    command: ["pg_isready"]  # 数据库故障会导致容器重启

✅ 正确做法：

# 正确示例
livenessProbe:
  httpGet:
    path: /healthz  # 只检查应用本身是否存活
    port: 8080

❌ 错误 3：探针间隔过短

# 错误示例
livenessProbe:
  periodSeconds: 1  # 过于频繁，增加系统负载

✅ 正确做法：

# 正确示例
livenessProbe:
  periodSeconds: 10  # 合理的检测间隔

实际配置示例

示例 1：完整的生产配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: production-app
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: production-app
    spec:
      containers:
      - name: app
        image: production-app:v1.0.0
        ports:
        - containerPort: 8080
        env:
        - name: ENV
          value: "production"
        # Startup Probe - 处理慢启动
        startupProbe:
          httpGet:
            path: /startup
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 10
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 30  # 最多等待5分钟
        # Liveness Probe - 检测应用存活
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 0  # Startup 成功后立即开始
          periodSeconds: 10
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3
        # Readiness Probe - 检测应用就绪
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 0  # Startup 成功后立即开始
          periodSeconds: 5
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 3
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"

示例 2：使用 TCP Socket 检测

apiVersion: v1
kind: Pod
metadata:
  name: tcp-probe-example
spec:
  containers:
  - name: redis
    image: redis:latest
    ports:
    - containerPort: 6379
    livenessProbe:
      tcpSocket:
        port: 6379
      initialDelaySeconds: 30
      periodSeconds: 10
    readinessProbe:
      tcpSocket:
        port: 6379
      initialDelaySeconds: 5
      periodSeconds: 5

示例 3：使用 Exec 命令检测

apiVersion: v1
kind: Pod
metadata:
  name: exec-probe-example
spec:
  containers:
  - name: app
    image: myapp:latest
    livenessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - "ps aux | grep -v grep | grep myapp || exit 1"
      initialDelaySeconds: 30
      periodSeconds: 10
    readinessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - "test -f /tmp/ready || exit 1"
      initialDelaySeconds: 5
      periodSeconds: 5

示例 4：Spring Boot 应用配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spring-boot-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: spring-boot-app:latest
        ports:
        - containerPort: 8080
        # Spring Boot Actuator 端点
        livenessProbe:
          httpGet:
            path: /actuator/health/liveness
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /actuator/health/readiness
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 5

常见问题与解决方案

问题 1：容器频繁重启

症状：容器不断重启，日志显示 Liveness Probe 失败

可能原因：

Liveness Probe 配置过于严格
应用启动时间超过 initialDelaySeconds
检测端点响应慢或超时

解决方案：

# 增加初始延迟时间
livenessProbe:
  initialDelaySeconds: 60  # 从 30 增加到 60
  periodSeconds: 15        # 从 10 增加到 15
  timeoutSeconds: 10       # 从 5 增加到 10
  failureThreshold: 5      # 从 3 增加到 5

问题 2：流量仍然路由到未就绪的 Pod

症状：Readiness Probe 失败，但仍有流量进入

可能原因：

Readiness Probe 配置错误
Service 配置问题
检测端点实现错误

解决方案：

# 检查 Pod 状态
kubectl get pods -o wide

# 检查 Readiness Probe 配置
kubectl describe pod <pod-name>

# 验证端点实现
kubectl exec <pod-name> -- curl http://localhost:8080/ready

问题 3：启动期间 Liveness Probe 误报

症状：应用启动时间较长，Liveness Probe 在启动期间失败

解决方案：使用 Startup Probe

startupProbe:
  httpGet:
    path: /startup
    port: 8080
  failureThreshold: 30  # 允许更长的启动时间
  periodSeconds: 10
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 0  # Startup 成功后立即开始

问题 4：探针检测影响应用性能

症状：健康检查端点执行耗时操作，影响应用性能

解决方案：

创建专门的轻量级健康检查端点
避免在检测逻辑中执行数据库查询或复杂计算
使用缓存减少检测开销

总结

Kubernetes 的三种探针（Liveness、Readiness、Startup）各有其特定的用途和行为：

Liveness Probe：确保容器存活，失败时重启容器
Readiness Probe：确保容器就绪，失败时停止接收流量但不重启
Startup Probe：处理慢启动应用，避免启动期间的误报

关键要点：

✅ 生产环境应该配置 Liveness 和 Readiness Probe
✅ 慢启动应用应该添加 Startup Probe
✅ 探针端点应该轻量级且快速响应
✅ 避免在探针中检查外部依赖
✅ 合理配置时间参数，避免过于频繁的检测
✅ 根据应用特性选择合适的检测方式（HTTP、TCP、Exec）

正确配置和使用这些探针可以显著提高 Kubernetes 应用的可靠性和可用性，确保流量只路由到健康的 Pod，并在应用出现问题时及时恢复。

Sep 15, 2025

Complete Guide to Enterprise Landing Zone Cloud Architecture: Multi-Vendor Best Practices and Alibaba Cloud Implementation

Comprehensive analysis of Landing Zone enterprise cloud architecture design, comparing best practices from major cloud providers including AWS, Azure, Alibaba Cloud, and Tencent Cloud, with complete Alibaba Cloud Landing Zone implementation case studies and automation deployment solutions

Sep 15, 2025

Comprehensive RDMA Network Technology Guide: From Protocol Principles to Production Practice

An in-depth exploration of RDMA (Remote Direct Memory Access) network technology, including protocol comparisons of InfiniBand, RoCE, iWARP, and production use cases in high-performance computing, distributed storage, AI training, and more.

Youqing Han

DevOps Engineer

Share this article:

Stay Updated

Get the latest DevOps insights and best practices delivered to your inbox

No spam, unsubscribe at any time

Kubernetes Probes 解析：Liveness、Readiness 和 Startup Probes

Kubernetes Probes 解析：Liveness、Readiness 和 Startup Probes

目录

探针概述

Liveness Probe 详解

定义与目的

检测失败后的行为

适用场景

配置示例

Readiness Probe 详解

定义与目的

检测失败后的行为

适用场景

配置示例

Startup Probe 详解

定义与目的

检测失败后的行为

适用场景

配置示例

三种探针对比

功能对比表

执行顺序

应用场景分析

场景 1：Web 应用（标准配置）

场景 2：慢启动应用（使用 Startup Probe）

场景 3：数据库连接依赖

场景 4：批处理任务

检测成功与失败后的行为

Liveness Probe 行为流程

Readiness Probe 行为流程

Startup Probe 行为流程

最佳实践

1. 探针类型选择

2. 检测方式选择

3. 时间参数配置

4. 端点设计

5. 避免常见错误

实际配置示例

示例 1：完整的生产配置

示例 2：使用 TCP Socket 检测

示例 3：使用 Exec 命令检测

示例 4：Spring Boot 应用配置

常见问题与解决方案

问题 1：容器频繁重启

问题 2：流量仍然路由到未就绪的 Pod

问题 3：启动期间 Liveness Probe 误报

问题 4：探针检测影响应用性能

总结

Complete Guide to Enterprise Landing Zone Cloud Architecture: Multi-Vendor Best Practices and Alibaba Cloud Implementation

Comprehensive RDMA Network Technology Guide: From Protocol Principles to Production Practice

Table of Contents

Table of Contents

Stay Updated