网站Logo linux从入门到入土

etcd高可用集群部署与运维

admin
6
2025-11-30

etcd高可用集群部署与运维

引言

etcd 是 Kubernetes 集群的"数据中心",存储着所有集群状态和配置信息。一个高可用的 etcd 集群是 K8S 高可用架构的基石。本文将深入讲解如何从零部署一个生产级的三节点 etcd 高可用集群,并涵盖日常运维、监控、备份恢复等关键技能。

一、etcd 架构与高可用原理

1.1 etcd 在 Kubernetes 中的角色

etcd 作为 Kubernetes 的分布式键值存储,存储了以下关键数据:

  • 节点信息(Nodes)
  • Pod 信息及调度状态
  • 服务发现和端点(Endpoints)
  • 配置信息(ConfigMaps、Secrets)
  • 所有 Kubernetes 对象的当前状态

1.2 Raft 共识算法

etcd 使用 Raft 算法实现分布式一致性,其核心概念:

graph TD A[客户端请求] --> B[Leader节点] B --> C[复制日志到Follower] C --> D[多数节点确认] D --> E[提交到状态机] E --> F[返回响应给客户端] G[Leader故障] --> H[选举超时] H --> I[新Leader选举] I --> J[集群继续服务]

选举机制

  • 每个节点有随机的选举超时时间(150-300ms)
  • Follower 在超时后变为 Candidate 并发起选举
  • 获得多数票的 Candidate 成为 Leader
  • 每个任期(Term)只有一个 Leader

1.3 集群规模建议

节点数 容错能力 建议场景
1 仅测试环境
3 1个节点故障 生产环境推荐
5 2个节点故障 大规模集群
7 3个节点故障 超大规模集群

黄金法则:生产环境至少使用 3 节点 etcd 集群。

二、环境准备

2.1 下载 etcd 二进制包

# 下载指定版本的 etcd
ETCD_VERSION="v3.5.22"
wget https://github.com/etcd-io/etcd/releases/download/${ETCD_VERSION}/etcd-${ETCD_VERSION}-linux-amd64.tar.gz

# 如果网络较慢,可使用国内镜像
wget https://mirrors.aliyun.com/etcd/${ETCD_VERSION}/etcd-${ETCD_VERSION}-linux-amd64.tar.gz

2.2 安装 etcd 工具

# 解压并安装到系统路径
tar -xf etcd-${ETCD_VERSION}-linux-amd64.tar.gz -C /usr/local/bin \
    etcd-${ETCD_VERSION}-linux-amd64/etcd \
    etcd-${ETCD_VERSION}-linux-amd64/etcdctl \
    --strip-components=1

# 验证安装
etcdctl version
# 输出:
# etcdctl version: 3.5.22
# API version: 3.5

2.3 分发到所有节点

# 从主节点分发到其他节点
for node in k8s-cluster242 k8s-cluster243; do
    scp /usr/local/bin/etcd* root@${node}:/usr/local/bin/
done

三、TLS 证书体系搭建

3.1 安装 CFSSL 工具

CFSSL 是 CloudFlare 开源的 PKI/TLS 工具集:

# 下载 CFSSL 工具包
wget https://github.com/cloudflare/cfssl/releases/download/v1.6.5/cfssl_1.6.5_linux_amd64 \
     https://github.com/cloudflare/cfssl/releases/download/v1.6.5/cfssljson_1.6.5_linux_amd64 \
     https://github.com/cloudflare/cfssl/releases/download/v1.6.5/cfssl-certinfo_1.6.5_linux_amd64

# 重命名并移动到 PATH
rename -v "s/_1.6.5_linux_amd64//g" cfssl*
mv cfssl cfssljson cfssl-certinfo /usr/local/bin/
chmod +x /usr/local/bin/cfssl*

# 验证安装
cfssl version
# 输出:Version: 1.6.5

3.2 创建证书目录结构

# 创建统一的证书管理目录
mkdir -pv /xiaozhi/{certs,pki}/etcd
tree /xiaozhi/
# 输出:
# /xiaozhi/
# ├── certs
# │   └── etcd
# └── pki
#     └── etcd

3.3 生成 CA 根证书

3.3.1 创建 CA 配置文件

cd /xiaozhi/pki/etcd

cat > etcd-ca-csr.json <<EOF
{
  "CN": "etcd",
  "key": {
    "algo": "rsa",
    "size": 2048
  },
  "names": [
    {
      "C": "CN",
      "ST": "Beijing",
      "L": "Beijing",
      "O": "etcd",
      "OU": "Etcd Security"
    }
  ],
  "ca": {
    "expiry": "876000h"  # 100年有效期
  }
}
EOF

3.3.2 生成 CA 证书

cfssl gencert -initca etcd-ca-csr.json | \
    cfssljson -bare /xiaozhi/certs/etcd/etcd-ca

# 查看生成的证书
ls -la /xiaozhi/certs/etcd/
# etcd-ca.csr      # 证书签名请求
# etcd-ca-key.pem  # 私钥
# etcd-ca.pem      # CA 根证书

3.4 配置证书签发策略

cat > ca-config.json <<EOF
{
  "signing": {
    "default": {
      "expiry": "876000h"
    },
    "profiles": {
      "etcd": {
        "usages": [
          "signing",
          "key encipherment",
          "server auth",
          "client auth"
        ],
        "expiry": "876000h"
      }
    }
  }
}
EOF

3.5 生成服务器证书

3.5.1 创建证书请求

cat > etcd-csr.json <<EOF
{
  "CN": "etcd",
  "key": {
    "algo": "rsa",
    "size": 2048
  },
  "names": [
    {
      "C": "CN",
      "ST": "Beijing",
      "L": "Beijing",
      "O": "etcd",
      "OU": "Etcd Security"
    }
  ]
}
EOF

3.5.2 生成服务器证书

cfssl gencert \
  -ca=/xiaozhi/certs/etcd/etcd-ca.pem \
  -ca-key=/xiaozhi/certs/etcd/etcd-ca-key.pem \
  -config=ca-config.json \
  --hostname=127.0.0.1,localhost, \
  k8s-cluster241,k8s-cluster242,k8s-cluster243, \
  10.0.0.241,10.0.0.242,10.0.0.243 \
  --profile=etcd \
  etcd-csr.json | cfssljson -bare /xiaozhi/certs/etcd/etcd-server

重要提示

  • --hostname 参数必须包含所有可能的访问地址
  • 包括:localhost、主机名、IP 地址
  • 这是 TLS/SSL 证书的 Subject Alternative Name (SAN) 要求

3.6 分发证书到集群

# 使用之前创建的同步脚本
data_rsync.sh /xiaozhi/certs/etcd/

# 验证其他节点证书
for node in k8s-cluster242 k8s-cluster243; do
    echo "=== Checking ${node} ==="
    ssh ${node} "ls -la /xiaozhi/certs/etcd/"
done

四、etcd 集群配置

4.1 配置文件详解

etcd 支持 YAML 格式的配置文件,以下是一个完整的三节点配置示例:

# /xiaozhi/softwares/etcd/etcd.config.yml
name: 'k8s-cluster241'  # 节点唯一标识
data-dir: /var/lib/etcd  # 数据目录
wal-dir: /var/lib/etcd/wal  # 预写日志目录

# 集群通信参数
listen-peer-urls: 'https://10.0.0.241:2380'
listen-client-urls: 'https://10.0.0.241:2379,http://127.0.0.1:2379'

# 广播地址
initial-advertise-peer-urls: 'https://10.0.0.241:2380'
advertise-client-urls: 'https://10.0.0.241:2379'

# 集群初始配置
initial-cluster: 'k8s-cluster241=https://10.0.0.241:2380,k8s-cluster242=https://10.0.0.242:2380,k8s-cluster243=https://10.0.0.243:2380'
initial-cluster-token: 'etcd-k8s-cluster'
initial-cluster-state: 'new'

# 选举参数(毫秒)
heartbeat-interval: 100
election-timeout: 1000

# 快照配置
snapshot-count: 5000
max-snapshots: 3
max-wals: 5

# 存储配额(0表示无限制)
quota-backend-bytes: 0

# 安全配置
client-transport-security:
  cert-file: '/xiaozhi/certs/etcd/etcd-server.pem'
  key-file: '/xiaozhi/certs/etcd/etcd-server-key.pem'
  client-cert-auth: true
  trusted-ca-file: '/xiaozhi/certs/etcd/etcd-ca.pem'
  auto-tls: true

peer-transport-security:
  cert-file: '/xiaozhi/certs/etcd/etcd-server.pem'
  key-file: '/xiaozhi/certs/etcd/etcd-server-key.pem'
  peer-client-cert-auth: true
  trusted-ca-file: '/xiaozhi/certs/etcd/etcd-ca.pem'
  auto-tls: true

# 功能开关
enable-v2: true  # 启用v2 API(某些工具依赖)
enable-pprof: true  # 启用性能分析

4.2 生成各节点配置

为每个节点创建对应的配置文件:

# 节点1: k8s-cluster241
mkdir -pv /xiaozhi/softwares/etcd
cat > /xiaozhi/softwares/etcd/etcd.config.yml <<'EOF'
# 内容如上,注意修改 name 和 IP 地址
EOF

# 节点2: k8s-cluster242(在节点2上执行)
cat > /xiaozhi/softwares/etcd/etcd.config.yml <<'EOF'
name: 'k8s-cluster242'
data-dir: /var/lib/etcd
wal-dir: /var/lib/etcd/wal
listen-peer-urls: 'https://10.0.0.242:2380'
listen-client-urls: 'https://10.0.0.242:2379,http://127.0.0.1:2379'
initial-advertise-peer-urls: 'https://10.0.0.242:2380'
advertise-client-urls: 'https://10.0.0.242:2379'
# ... 其他配置相同
EOF

# 节点3: k8s-cluster243(在节点3上执行)
# 类似配置,修改对应名称和IP

4.3 创建 Systemd 服务

cat > /usr/lib/systemd/system/etcd.service <<'EOF'
[Unit]
Description=Jason Yin's Etcd Service
Documentation=https://coreos.com/etcd/docs/latest/
After=network.target

[Service]
Type=notify
ExecStart=/usr/local/bin/etcd --config-file=/xiaozhi/softwares/etcd/etcd.config.yml
Restart=on-failure
RestartSec=10
LimitNOFILE=65536

# 安全加固
ReadWritePaths=/var/lib/etcd
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true

[Install]
WantedBy=multi-user.target
Alias=etcd3.service
EOF

五、启动与验证集群

5.1 启动所有节点

# 所有节点执行
systemctl daemon-reload
systemctl enable --now etcd
systemctl status etcd

# 查看日志
journalctl -u etcd -f --lines=50

5.2 验证集群状态

# 使用 etcdctl 检查集群状态
etcdctl --endpoints="https://10.0.0.241:2379,https://10.0.0.242:2379,https://10.0.0.243:2379" \
  --cacert=/xiaozhi/certs/etcd/etcd-ca.pem \
  --cert=/xiaozhi/certs/etcd/etcd-server.pem \
  --key=/xiaozhi/certs/etcd/etcd-server-key.pem \
  endpoint status --write-out=table

预期输出:

+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.0.241:2379 | 566d563f3c9274ed |  3.5.21 |   25 kB |      true |      false |         2 |          9 |                  9 |        |
| https://10.0.0.242:2379 | b83b69ba7d246b29 |  3.5.21 |   25 kB |     false |      false |         2 |          9 |                  9 |        |
| https://10.0.0.243:2379 |  47b70f9ecb1f200 |  3.5.21 |   20 kB |     false |      false |         2 |          9 |                  9 |        |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

5.3 高可用性测试

5.3.1 Leader 故障转移测试

# 1. 确定当前 Leader
LEADER_ENDPOINT=$(etcdctl endpoint status --write-out=json | \
  jq -r '.Status[] | select(.IsLeader==true) | .Endpoint')

echo "Current Leader: ${LEADER_ENDPOINT}"

# 2. 停止 Leader 节点
NODE_NAME=$(echo ${LEADER_ENDPOINT} | cut -d: -f2 | cut -d/ -f3)
ssh root@${NODE_NAME} "systemctl stop etcd"

# 3. 等待选举(通常 1-2 秒内完成)
sleep 3

# 4. 检查新 Leader
etcdctl endpoint status --write-out=table

# 5. 恢复故障节点
ssh root@${NODE_NAME} "systemctl start etcd"

5.3.2 网络分区模拟

# 模拟网络分区(在节点1上执行)
iptables -A INPUT -s 10.0.0.242 -j DROP
iptables -A INPUT -s 10.0.0.243 -j DROP

# 观察集群状态(应显示多数节点不可用)
etcdctl endpoint status --write-out=table

# 恢复网络
iptables -D INPUT -s 10.0.0.242 -j DROP
iptables -D INPUT -s 10.0.0.243 -j DROP

六、etcd 日常运维

6.1 添加命令别名

为方便操作,为 etcdctl 添加别名:

# 编辑 bashrc
cat >> ~/.bashrc <<'EOF'
alias etcdctl='etcdctl \
  --endpoints="10.0.0.241:2379,10.0.0.242:2379,10.0.0.243:2379" \
  --cacert=/xiaozhi/certs/etcd/etcd-ca.pem \
  --cert=/xiaozhi/certs/etcd/etcd-server.pem \
  --key=/xiaozhi/certs/etcd/etcd-server-key.pem'
EOF

source ~/.bashrc

# 验证别名
etcdctl endpoint status --write-out=table

6.2 基本数据操作

etcd 提供类似 Redis 的键值存储操作:

# 1. 写入数据
etcdctl put /cluster/nodes/node1 "10.0.0.241"
etcdctl put /cluster/nodes/node2 "10.0.0.242"
etcdctl put /cluster/nodes/node3 "10.0.0.243"

# 2. 读取数据
etcdctl get /cluster/nodes/node1
# 输出:/cluster/nodes/node1\n10.0.0.241

# 3. 前缀查询
etcdctl get /cluster/nodes --prefix
# 输出所有 /cluster/nodes/ 下的键值

# 4. 仅查看键或值
etcdctl get /cluster/nodes --prefix --keys-only
etcdctl get /cluster/nodes --prefix --print-value-only

# 5. 监视键变化
etcdctl watch /cluster/nodes/node1 &
etcdctl put /cluster/nodes/node1 "updated-10.0.0.241"

# 6. 删除操作
etcdctl del /cluster/nodes/node1
etcdctl del /cluster/nodes --prefix  # 删除前缀所有

6.3 集群维护命令

# 查看成员列表
etcdctl member list -w table

# 添加新成员
etcdctl member add node4 --peer-urls="https://10.0.0.244:2380"

# 移除故障成员
etcdctl member remove <member-id>

# 更新成员URL
etcdctl member update <member-id> --peer-urls="https://new-ip:2380"

七、数据备份与恢复

7.1 自动备份策略

7.1.1 创建备份脚本

cat > /usr/local/sbin/etcd-backup.sh <<'EOF'
#!/bin/bash
# etcd 自动备份脚本
# 作者:xiaozhi

BACKUP_DIR="/data/etcd-backup"
RETENTION_DAYS=7
ETCD_ENDPOINTS="https://10.0.0.241:2379"
CACERT="/xiaozhi/certs/etcd/etcd-ca.pem"
CERT="/xiaozhi/certs/etcd/etcd-server.pem"
KEY="/xiaozhi/certs/etcd/etcd-server-key.pem"

# 创建备份目录
mkdir -p ${BACKUP_DIR}

# 生成备份文件名
BACKUP_FILE="${BACKUP_DIR}/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db"

# 执行备份
echo "[$(date '+%Y-%m-%d %H:%M:%S')] Starting etcd backup..."
etcdctl snapshot save ${BACKUP_FILE} \
  --endpoints=${ETCD_ENDPOINTS} \
  --cacert=${CACERT} \
  --cert=${CERT} \
  --key=${KEY}

if [ $? -eq 0 ]; then
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] Backup completed: ${BACKUP_FILE}"
  
    # 检查备份完整性
    etcdctl snapshot status ${BACKUP_FILE} -w table
  
    # 清理旧备份
    find ${BACKUP_DIR} -name "etcd-snapshot-*.db" -mtime +${RETENTION_DAYS} -delete
  
    # 记录备份日志
    echo "$(date '+%Y-%m-%d %H:%M:%S') Backup successful: ${BACKUP_FILE}" >> /var/log/etcd-backup.log
else
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] Backup failed!"
    echo "$(date '+%Y-%m-%d %H:%M:%S') Backup failed" >> /var/log/etcd-backup.log
    exit 1
fi
EOF

chmod +x /usr/local/sbin/etcd-backup.sh

7.1.2 配置定时备份

# 添加定时任务
cat > /etc/cron.d/etcd-backup <<'EOF'
# 每天凌晨 2 点执行备份
0 2 * * * root /usr/local/sbin/etcd-backup.sh > /var/log/etcd-backup-cron.log 2>&1
EOF

# 测试备份脚本
/usr/local/sbin/etcd-backup.sh

7.2 手动备份与恢复

7.2.1 创建测试数据

# 创建一些测试数据
for i in {1..10}; do
    etcdctl put /test/key${i} "value${i}-$(date +%s)"
done

# 验证数据
etcdctl get /test --prefix --keys-only | wc -l

7.2.2 执行备份

# 创建快照
BACKUP_FILE="/tmp/etcd-snapshot-$(date +%F).db"
etcdctl snapshot save ${BACKUP_FILE}

# 查看备份状态
etcdctl snapshot status ${BACKUP_FILE} -w table
# 输出:
# +---------+----------+------------+------------+
# |  HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
# +---------+----------+------------+------------+
# | e546d7a |       11 |         20 |      20 kB |
# +---------+----------+------------+------------+

7.2.3 模拟数据丢失

# 模拟灾难场景 - 删除所有数据
etcdctl del "" --prefix

# 验证数据已清空
etcdctl get "" --prefix

7.2.4 执行恢复

# 1. 停止所有 etcd 节点
systemctl stop etcd

# 2. 备份原数据目录
mv /var/lib/etcd /var/lib/etcd.backup.$(date +%s)

# 3. 从快照恢复
etcdctl snapshot restore ${BACKUP_FILE} \
  --data-dir=/var/lib/etcd-new \
  --name=k8s-cluster241 \
  --initial-cluster="k8s-cluster241=https://10.0.0.241:2380,k8s-cluster242=https://10.0.0.242:2380,k8s-cluster243=https://10.0.0.243:2380" \
  --initial-cluster-token="etcd-k8s-cluster" \
  --initial-advertise-peer-urls="https://10.0.0.241:2380"

# 4. 更新配置文件指向新数据目录
sed -i 's#/var/lib/etcd#/var/lib/etcd-new#g' /xiaozhi/softwares/etcd/etcd.config.yml

# 5. 启动服务
systemctl start etcd

# 6. 验证数据恢复
etcdctl get /test --prefix

7.3 跨集群迁移

# 从源集群备份
etcdctl --endpoints=<source-cluster> snapshot save snapshot.db

# 恢复到目标集群(每个节点)
etcdctl snapshot restore snapshot.db \
  --data-dir /var/lib/etcd-new \
  --name <member-name> \
  --initial-cluster <new-cluster-config> \
  --initial-cluster-token <new-token> \
  --initial-advertise-peer-urls <peer-url>

八、监控与告警

8.1 健康检查端点

etcd 提供内置的健康检查接口:

# HTTP 健康检查
curl -k https://10.0.0.241:2379/health
# 输出:{"health":"true"}

# 详细健康状态
curl -k https://10.0.0.241:2379/health?detailed=true

# 指标端点(Prometheus)
curl -k https://10.0.0.241:2379/metrics

8.2 关键监控指标

创建监控脚本:

cat > /usr/local/sbin/etcd-monitor.sh <<'EOF'
#!/bin/bash
# etcd 集群监控脚本

ENDPOINTS="10.0.0.241:2379,10.0.0.242:2379,10.0.0.243:2379"
CACERT="/xiaozhi/certs/etcd/etcd-ca.pem"
CERT="/xiaozhi/certs/etcd/etcd-server.pem"
KEY="/xiaozhi/certs/etcd/etcd-server-key.pem"

echo "=== etcd Cluster Status ==="
date
echo

# 检查集群状态
etcdctl endpoint status --endpoints=${ENDPOINTS} \
  --cacert=${CACERT} --cert=${CERT} --key=${KEY} \
  --write-out=table

echo
echo "=== Cluster Health ==="
for endpoint in $(echo ${ENDPOINTS} | tr ',' ' '); do
    health=$(etcdctl endpoint health --endpoints=${endpoint} \
      --cacert=${CACERT} --cert=${CERT} --key=${KEY} 2>/dev/null || echo "unhealthy")
    echo "${endpoint}: ${health}"
done

echo
echo "=== Alarm List ==="
etcdctl alarm list --endpoints=${ENDPOINTS} \
  --cacert=${CACERT} --cert=${CERT} --key=${KEY}

echo
echo "=== DB Size ==="
etcdctl endpoint status --endpoints=${ENDPOINTS} \
  --cacert=${CACERT} --cert=${CERT} --key=${KEY} \
  --write-out=json | jq -r '.Status[] | "\(.Endpoint): \(.DbSize)"'
EOF

chmod +x /usr/local/sbin/etcd-monitor.sh

8.3 Prometheus 监控配置

# prometheus.yml 配置示例
scrape_configs:
  - job_name: 'etcd'
    scheme: https
    tls_config:
      ca_file: /xiaozhi/certs/etcd/etcd-ca.pem
      cert_file: /xiaozhi/certs/etcd/etcd-server.pem
      key_file: /xiaozhi/certs/etcd/etcd-server-key.pem
      insecure_skip_verify: true
    static_configs:
    - targets:
      - '10.0.0.241:2379'
      - '10.0.0.242:2379'
      - '10.0.0.243:2379'

九、性能调优

9.1 硬件要求

指标 最小要求 推荐配置 生产环境
CPU 2 核心 4 核心 8+ 核心
内存 4 GB 8 GB 16+ GB
磁盘 100 GB SSD 200 GB NVMe 500 GB NVMe RAID
IOPS 500 1500 5000+
延迟 < 10ms < 5ms < 1ms

9.2 关键参数调优

# 生产环境调优配置
# /xiaozhi/softwares/etcd/etcd.config.yml

# 性能相关参数
snapshot-count: 100000  # 提高快照阈值
quota-backend-bytes: 8589934592  # 8GB 存储限制
max-request-bytes: 15728640  # 15MB 请求限制
max-txn-ops: 32768  # 事务操作限制

# 网络调优
heartbeat-interval: 150  # 心跳间隔
election-timeout: 1500  # 选举超时
initial-election-tick-advance: true

# 磁盘优化
auto-compaction-mode: periodic
auto-compaction-retention: "1h"  # 每小时自动压缩
enable-grpc-gateway: true

9.3 性能测试

# 安装基准测试工具
go install go.etcd.io/etcd/tools/benchmark@latest

# 运行基准测试
benchmark \
  --endpoints="https://10.0.0.241:2379" \
  --target-leader \
  --conns=100 \
  --clients=1000 \
  put \
  --key-size=8 \
  --sequential-keys \
  --total=100000 \
  --val-size=256

# 测试结果解读:
# - 平均延迟:应 < 10ms
# - QPS:应 > 10000
# - 成功率:应 = 100%

十、故障排除指南

10.1 常见问题与解决

问题1:etcd 启动失败

# 查看详细日志
journalctl -u etcd -xe --no-pager

# 常见原因:
# 1. 证书问题
openssl x509 -in /xiaozhi/certs/etcd/etcd-server.pem -text -noout

# 2. 端口冲突
ss -tlnp | grep -E '2379|2380'

# 3. 数据目录权限
ls -la /var/lib/etcd/
chown -R etcd:etcd /var/lib/etcd/

问题2:集群脑裂

# 检查各节点状态
for node in 241 242 243; do
    echo "Node 10.0.0.${node}:"
    etcdctl --endpoints=10.0.0.${node}:2379 endpoint status
done

# 解决方案:
# 1. 停止少数派节点
# 2. 在多数派上移除故障节点
# 3. 清理数据并重新加入

问题3:磁盘空间不足

# 检查磁盘使用
df -h /var/lib/etcd

# 清理旧数据
etcdctl defrag --endpoints=localhost:2379

# 设置存储配额
etcdctl endpoint status --write-out=json | \
    jq -r '.Status[] | "\(.Endpoint): \(.DbSize)"'

10.2 诊断工具

# 1. 查看内部状态
etcdctl check perf

# 2. 诊断命令
etcdctl debug

# 3. 生成诊断报告
etcdctl diagnostic --output-file=etcd-report.tar.gz

# 4. 性能分析(需要启用 pprof)
go tool pprof http://localhost:2379/debug/pprof/profile

十一、安全加固

11.1 网络访问控制

# 使用 iptables 限制访问
iptables -A INPUT -p tcp --dport 2379 -s 10.0.0.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2380 -s 10.0.0.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2379 -j DROP
iptables -A INPUT -p tcp --dport 2380 -j DROP

11.2 启用审计日志

# 在配置中添加审计日志
audit-log: /var/log/etcd/audit.log
audit-log-maxage: 30
audit-log-maxbackups: 10
audit-log-maxsize: 100

11.3 定期轮换证书

# 证书轮换脚本示例
cat > /usr/local/sbin/rotate-etcd-certs.sh <<'EOF'
#!/bin/bash
# 生成新证书
cfssl gencert \
  -ca=/xiaozhi/certs/etcd/etcd-ca.pem \
  -ca-key=/xiaozhi/certs/etcd/etcd-ca-key.pem \
  -config=ca-config.json \
  --hostname=127.0.0.1,k8s-cluster241,k8s-cluster242,k8s-cluster243,10.0.0.241,10.0.0.242,10.0.0.243 \
  --profile=etcd \
  etcd-csr.json | cfssljson -bare /xiaozhi/certs/etcd/etcd-server-new

# 逐步重启节点(滚动更新)
for node in 241 242 243; do
    scp /xiaozhi/certs/etcd/etcd-server-new*.pem root@10.0.0.${node}:/xiaozhi/certs/etcd/
    ssh root@10.0.0.${node} "mv /xiaozhi/certs/etcd/etcd-server.pem /xiaozhi/certs/etcd/etcd-server-old.pem"
    ssh root@10.0.0.${node} "mv /xiaozhi/certs/etcd/etcd-server-new.pem /xiaozhi/certs/etcd/etcd-server.pem"
    ssh root@10.0.0.${node} "systemctl restart etcd"
    sleep 10
done
EOF

总结

关键收获

  1. 证书体系是基础:正确的 TLS 证书配置是 etcd 集群安全运行的基石
  2. 配置一致性:确保所有节点的配置文件中集群信息完全一致
  3. 备份是生命线:定期备份和测试恢复流程至关重要
  4. 监控不可少:实时监控集群状态和性能指标

最佳实践清单

  • 使用奇数个节点(3、5、7)
  • 配置自动备份和保留策略
  • 启用监控和告警
  • 定期测试故障转移
  • 保持 etcd 版本与 K8S 兼容
  • 使用 SSD/NVMe 存储
  • 分离 etcd 与业务网络
  • 定期进行恢复演练

下一步行动

  1. 验证 etcd 集群健康状态
  2. 配置自动备份策略
  3. 设置监控告警
  4. 进行故障转移测试
  5. 文档化恢复流程

经验分享:生产环境中,建议将 etcd 部署在专用的机器上,与控制平面分离,避免资源竞争影响集群稳定性。更新策略

动物装饰