etcd高可用集群部署与运维
引言
etcd 是 Kubernetes 集群的"数据中心",存储着所有集群状态和配置信息。一个高可用的 etcd 集群是 K8S 高可用架构的基石。本文将深入讲解如何从零部署一个生产级的三节点 etcd 高可用集群,并涵盖日常运维、监控、备份恢复等关键技能。
一、etcd 架构与高可用原理
1.1 etcd 在 Kubernetes 中的角色
etcd 作为 Kubernetes 的分布式键值存储,存储了以下关键数据:
- 节点信息(Nodes)
- Pod 信息及调度状态
- 服务发现和端点(Endpoints)
- 配置信息(ConfigMaps、Secrets)
- 所有 Kubernetes 对象的当前状态
1.2 Raft 共识算法
etcd 使用 Raft 算法实现分布式一致性,其核心概念:
graph TD
A[客户端请求] --> B[Leader节点]
B --> C[复制日志到Follower]
C --> D[多数节点确认]
D --> E[提交到状态机]
E --> F[返回响应给客户端]
G[Leader故障] --> H[选举超时]
H --> I[新Leader选举]
I --> J[集群继续服务]
选举机制:
- 每个节点有随机的选举超时时间(150-300ms)
- Follower 在超时后变为 Candidate 并发起选举
- 获得多数票的 Candidate 成为 Leader
- 每个任期(Term)只有一个 Leader
1.3 集群规模建议
| 节点数 | 容错能力 | 建议场景 |
|---|---|---|
| 1 | 无 | 仅测试环境 |
| 3 | 1个节点故障 | 生产环境推荐 |
| 5 | 2个节点故障 | 大规模集群 |
| 7 | 3个节点故障 | 超大规模集群 |
黄金法则:生产环境至少使用 3 节点 etcd 集群。
二、环境准备
2.1 下载 etcd 二进制包
# 下载指定版本的 etcd
ETCD_VERSION="v3.5.22"
wget https://github.com/etcd-io/etcd/releases/download/${ETCD_VERSION}/etcd-${ETCD_VERSION}-linux-amd64.tar.gz
# 如果网络较慢,可使用国内镜像
wget https://mirrors.aliyun.com/etcd/${ETCD_VERSION}/etcd-${ETCD_VERSION}-linux-amd64.tar.gz
2.2 安装 etcd 工具
# 解压并安装到系统路径
tar -xf etcd-${ETCD_VERSION}-linux-amd64.tar.gz -C /usr/local/bin \
etcd-${ETCD_VERSION}-linux-amd64/etcd \
etcd-${ETCD_VERSION}-linux-amd64/etcdctl \
--strip-components=1
# 验证安装
etcdctl version
# 输出:
# etcdctl version: 3.5.22
# API version: 3.5
2.3 分发到所有节点
# 从主节点分发到其他节点
for node in k8s-cluster242 k8s-cluster243; do
scp /usr/local/bin/etcd* root@${node}:/usr/local/bin/
done
三、TLS 证书体系搭建
3.1 安装 CFSSL 工具
CFSSL 是 CloudFlare 开源的 PKI/TLS 工具集:
# 下载 CFSSL 工具包
wget https://github.com/cloudflare/cfssl/releases/download/v1.6.5/cfssl_1.6.5_linux_amd64 \
https://github.com/cloudflare/cfssl/releases/download/v1.6.5/cfssljson_1.6.5_linux_amd64 \
https://github.com/cloudflare/cfssl/releases/download/v1.6.5/cfssl-certinfo_1.6.5_linux_amd64
# 重命名并移动到 PATH
rename -v "s/_1.6.5_linux_amd64//g" cfssl*
mv cfssl cfssljson cfssl-certinfo /usr/local/bin/
chmod +x /usr/local/bin/cfssl*
# 验证安装
cfssl version
# 输出:Version: 1.6.5
3.2 创建证书目录结构
# 创建统一的证书管理目录
mkdir -pv /xiaozhi/{certs,pki}/etcd
tree /xiaozhi/
# 输出:
# /xiaozhi/
# ├── certs
# │ └── etcd
# └── pki
# └── etcd
3.3 生成 CA 根证书
3.3.1 创建 CA 配置文件
cd /xiaozhi/pki/etcd
cat > etcd-ca-csr.json <<EOF
{
"CN": "etcd",
"key": {
"algo": "rsa",
"size": 2048
},
"names": [
{
"C": "CN",
"ST": "Beijing",
"L": "Beijing",
"O": "etcd",
"OU": "Etcd Security"
}
],
"ca": {
"expiry": "876000h" # 100年有效期
}
}
EOF
3.3.2 生成 CA 证书
cfssl gencert -initca etcd-ca-csr.json | \
cfssljson -bare /xiaozhi/certs/etcd/etcd-ca
# 查看生成的证书
ls -la /xiaozhi/certs/etcd/
# etcd-ca.csr # 证书签名请求
# etcd-ca-key.pem # 私钥
# etcd-ca.pem # CA 根证书
3.4 配置证书签发策略
cat > ca-config.json <<EOF
{
"signing": {
"default": {
"expiry": "876000h"
},
"profiles": {
"etcd": {
"usages": [
"signing",
"key encipherment",
"server auth",
"client auth"
],
"expiry": "876000h"
}
}
}
}
EOF
3.5 生成服务器证书
3.5.1 创建证书请求
cat > etcd-csr.json <<EOF
{
"CN": "etcd",
"key": {
"algo": "rsa",
"size": 2048
},
"names": [
{
"C": "CN",
"ST": "Beijing",
"L": "Beijing",
"O": "etcd",
"OU": "Etcd Security"
}
]
}
EOF
3.5.2 生成服务器证书
cfssl gencert \
-ca=/xiaozhi/certs/etcd/etcd-ca.pem \
-ca-key=/xiaozhi/certs/etcd/etcd-ca-key.pem \
-config=ca-config.json \
--hostname=127.0.0.1,localhost, \
k8s-cluster241,k8s-cluster242,k8s-cluster243, \
10.0.0.241,10.0.0.242,10.0.0.243 \
--profile=etcd \
etcd-csr.json | cfssljson -bare /xiaozhi/certs/etcd/etcd-server
重要提示:
--hostname参数必须包含所有可能的访问地址- 包括:localhost、主机名、IP 地址
- 这是 TLS/SSL 证书的 Subject Alternative Name (SAN) 要求
3.6 分发证书到集群
# 使用之前创建的同步脚本
data_rsync.sh /xiaozhi/certs/etcd/
# 验证其他节点证书
for node in k8s-cluster242 k8s-cluster243; do
echo "=== Checking ${node} ==="
ssh ${node} "ls -la /xiaozhi/certs/etcd/"
done
四、etcd 集群配置
4.1 配置文件详解
etcd 支持 YAML 格式的配置文件,以下是一个完整的三节点配置示例:
# /xiaozhi/softwares/etcd/etcd.config.yml
name: 'k8s-cluster241' # 节点唯一标识
data-dir: /var/lib/etcd # 数据目录
wal-dir: /var/lib/etcd/wal # 预写日志目录
# 集群通信参数
listen-peer-urls: 'https://10.0.0.241:2380'
listen-client-urls: 'https://10.0.0.241:2379,http://127.0.0.1:2379'
# 广播地址
initial-advertise-peer-urls: 'https://10.0.0.241:2380'
advertise-client-urls: 'https://10.0.0.241:2379'
# 集群初始配置
initial-cluster: 'k8s-cluster241=https://10.0.0.241:2380,k8s-cluster242=https://10.0.0.242:2380,k8s-cluster243=https://10.0.0.243:2380'
initial-cluster-token: 'etcd-k8s-cluster'
initial-cluster-state: 'new'
# 选举参数(毫秒)
heartbeat-interval: 100
election-timeout: 1000
# 快照配置
snapshot-count: 5000
max-snapshots: 3
max-wals: 5
# 存储配额(0表示无限制)
quota-backend-bytes: 0
# 安全配置
client-transport-security:
cert-file: '/xiaozhi/certs/etcd/etcd-server.pem'
key-file: '/xiaozhi/certs/etcd/etcd-server-key.pem'
client-cert-auth: true
trusted-ca-file: '/xiaozhi/certs/etcd/etcd-ca.pem'
auto-tls: true
peer-transport-security:
cert-file: '/xiaozhi/certs/etcd/etcd-server.pem'
key-file: '/xiaozhi/certs/etcd/etcd-server-key.pem'
peer-client-cert-auth: true
trusted-ca-file: '/xiaozhi/certs/etcd/etcd-ca.pem'
auto-tls: true
# 功能开关
enable-v2: true # 启用v2 API(某些工具依赖)
enable-pprof: true # 启用性能分析
4.2 生成各节点配置
为每个节点创建对应的配置文件:
# 节点1: k8s-cluster241
mkdir -pv /xiaozhi/softwares/etcd
cat > /xiaozhi/softwares/etcd/etcd.config.yml <<'EOF'
# 内容如上,注意修改 name 和 IP 地址
EOF
# 节点2: k8s-cluster242(在节点2上执行)
cat > /xiaozhi/softwares/etcd/etcd.config.yml <<'EOF'
name: 'k8s-cluster242'
data-dir: /var/lib/etcd
wal-dir: /var/lib/etcd/wal
listen-peer-urls: 'https://10.0.0.242:2380'
listen-client-urls: 'https://10.0.0.242:2379,http://127.0.0.1:2379'
initial-advertise-peer-urls: 'https://10.0.0.242:2380'
advertise-client-urls: 'https://10.0.0.242:2379'
# ... 其他配置相同
EOF
# 节点3: k8s-cluster243(在节点3上执行)
# 类似配置,修改对应名称和IP
4.3 创建 Systemd 服务
cat > /usr/lib/systemd/system/etcd.service <<'EOF'
[Unit]
Description=Jason Yin's Etcd Service
Documentation=https://coreos.com/etcd/docs/latest/
After=network.target
[Service]
Type=notify
ExecStart=/usr/local/bin/etcd --config-file=/xiaozhi/softwares/etcd/etcd.config.yml
Restart=on-failure
RestartSec=10
LimitNOFILE=65536
# 安全加固
ReadWritePaths=/var/lib/etcd
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
[Install]
WantedBy=multi-user.target
Alias=etcd3.service
EOF
五、启动与验证集群
5.1 启动所有节点
# 所有节点执行
systemctl daemon-reload
systemctl enable --now etcd
systemctl status etcd
# 查看日志
journalctl -u etcd -f --lines=50
5.2 验证集群状态
# 使用 etcdctl 检查集群状态
etcdctl --endpoints="https://10.0.0.241:2379,https://10.0.0.242:2379,https://10.0.0.243:2379" \
--cacert=/xiaozhi/certs/etcd/etcd-ca.pem \
--cert=/xiaozhi/certs/etcd/etcd-server.pem \
--key=/xiaozhi/certs/etcd/etcd-server-key.pem \
endpoint status --write-out=table
预期输出:
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.0.241:2379 | 566d563f3c9274ed | 3.5.21 | 25 kB | true | false | 2 | 9 | 9 | |
| https://10.0.0.242:2379 | b83b69ba7d246b29 | 3.5.21 | 25 kB | false | false | 2 | 9 | 9 | |
| https://10.0.0.243:2379 | 47b70f9ecb1f200 | 3.5.21 | 20 kB | false | false | 2 | 9 | 9 | |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
5.3 高可用性测试
5.3.1 Leader 故障转移测试
# 1. 确定当前 Leader
LEADER_ENDPOINT=$(etcdctl endpoint status --write-out=json | \
jq -r '.Status[] | select(.IsLeader==true) | .Endpoint')
echo "Current Leader: ${LEADER_ENDPOINT}"
# 2. 停止 Leader 节点
NODE_NAME=$(echo ${LEADER_ENDPOINT} | cut -d: -f2 | cut -d/ -f3)
ssh root@${NODE_NAME} "systemctl stop etcd"
# 3. 等待选举(通常 1-2 秒内完成)
sleep 3
# 4. 检查新 Leader
etcdctl endpoint status --write-out=table
# 5. 恢复故障节点
ssh root@${NODE_NAME} "systemctl start etcd"
5.3.2 网络分区模拟
# 模拟网络分区(在节点1上执行)
iptables -A INPUT -s 10.0.0.242 -j DROP
iptables -A INPUT -s 10.0.0.243 -j DROP
# 观察集群状态(应显示多数节点不可用)
etcdctl endpoint status --write-out=table
# 恢复网络
iptables -D INPUT -s 10.0.0.242 -j DROP
iptables -D INPUT -s 10.0.0.243 -j DROP
六、etcd 日常运维
6.1 添加命令别名
为方便操作,为 etcdctl 添加别名:
# 编辑 bashrc
cat >> ~/.bashrc <<'EOF'
alias etcdctl='etcdctl \
--endpoints="10.0.0.241:2379,10.0.0.242:2379,10.0.0.243:2379" \
--cacert=/xiaozhi/certs/etcd/etcd-ca.pem \
--cert=/xiaozhi/certs/etcd/etcd-server.pem \
--key=/xiaozhi/certs/etcd/etcd-server-key.pem'
EOF
source ~/.bashrc
# 验证别名
etcdctl endpoint status --write-out=table
6.2 基本数据操作
etcd 提供类似 Redis 的键值存储操作:
# 1. 写入数据
etcdctl put /cluster/nodes/node1 "10.0.0.241"
etcdctl put /cluster/nodes/node2 "10.0.0.242"
etcdctl put /cluster/nodes/node3 "10.0.0.243"
# 2. 读取数据
etcdctl get /cluster/nodes/node1
# 输出:/cluster/nodes/node1\n10.0.0.241
# 3. 前缀查询
etcdctl get /cluster/nodes --prefix
# 输出所有 /cluster/nodes/ 下的键值
# 4. 仅查看键或值
etcdctl get /cluster/nodes --prefix --keys-only
etcdctl get /cluster/nodes --prefix --print-value-only
# 5. 监视键变化
etcdctl watch /cluster/nodes/node1 &
etcdctl put /cluster/nodes/node1 "updated-10.0.0.241"
# 6. 删除操作
etcdctl del /cluster/nodes/node1
etcdctl del /cluster/nodes --prefix # 删除前缀所有
6.3 集群维护命令
# 查看成员列表
etcdctl member list -w table
# 添加新成员
etcdctl member add node4 --peer-urls="https://10.0.0.244:2380"
# 移除故障成员
etcdctl member remove <member-id>
# 更新成员URL
etcdctl member update <member-id> --peer-urls="https://new-ip:2380"
七、数据备份与恢复
7.1 自动备份策略
7.1.1 创建备份脚本
cat > /usr/local/sbin/etcd-backup.sh <<'EOF'
#!/bin/bash
# etcd 自动备份脚本
# 作者:xiaozhi
BACKUP_DIR="/data/etcd-backup"
RETENTION_DAYS=7
ETCD_ENDPOINTS="https://10.0.0.241:2379"
CACERT="/xiaozhi/certs/etcd/etcd-ca.pem"
CERT="/xiaozhi/certs/etcd/etcd-server.pem"
KEY="/xiaozhi/certs/etcd/etcd-server-key.pem"
# 创建备份目录
mkdir -p ${BACKUP_DIR}
# 生成备份文件名
BACKUP_FILE="${BACKUP_DIR}/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db"
# 执行备份
echo "[$(date '+%Y-%m-%d %H:%M:%S')] Starting etcd backup..."
etcdctl snapshot save ${BACKUP_FILE} \
--endpoints=${ETCD_ENDPOINTS} \
--cacert=${CACERT} \
--cert=${CERT} \
--key=${KEY}
if [ $? -eq 0 ]; then
echo "[$(date '+%Y-%m-%d %H:%M:%S')] Backup completed: ${BACKUP_FILE}"
# 检查备份完整性
etcdctl snapshot status ${BACKUP_FILE} -w table
# 清理旧备份
find ${BACKUP_DIR} -name "etcd-snapshot-*.db" -mtime +${RETENTION_DAYS} -delete
# 记录备份日志
echo "$(date '+%Y-%m-%d %H:%M:%S') Backup successful: ${BACKUP_FILE}" >> /var/log/etcd-backup.log
else
echo "[$(date '+%Y-%m-%d %H:%M:%S')] Backup failed!"
echo "$(date '+%Y-%m-%d %H:%M:%S') Backup failed" >> /var/log/etcd-backup.log
exit 1
fi
EOF
chmod +x /usr/local/sbin/etcd-backup.sh
7.1.2 配置定时备份
# 添加定时任务
cat > /etc/cron.d/etcd-backup <<'EOF'
# 每天凌晨 2 点执行备份
0 2 * * * root /usr/local/sbin/etcd-backup.sh > /var/log/etcd-backup-cron.log 2>&1
EOF
# 测试备份脚本
/usr/local/sbin/etcd-backup.sh
7.2 手动备份与恢复
7.2.1 创建测试数据
# 创建一些测试数据
for i in {1..10}; do
etcdctl put /test/key${i} "value${i}-$(date +%s)"
done
# 验证数据
etcdctl get /test --prefix --keys-only | wc -l
7.2.2 执行备份
# 创建快照
BACKUP_FILE="/tmp/etcd-snapshot-$(date +%F).db"
etcdctl snapshot save ${BACKUP_FILE}
# 查看备份状态
etcdctl snapshot status ${BACKUP_FILE} -w table
# 输出:
# +---------+----------+------------+------------+
# | HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
# +---------+----------+------------+------------+
# | e546d7a | 11 | 20 | 20 kB |
# +---------+----------+------------+------------+
7.2.3 模拟数据丢失
# 模拟灾难场景 - 删除所有数据
etcdctl del "" --prefix
# 验证数据已清空
etcdctl get "" --prefix
7.2.4 执行恢复
# 1. 停止所有 etcd 节点
systemctl stop etcd
# 2. 备份原数据目录
mv /var/lib/etcd /var/lib/etcd.backup.$(date +%s)
# 3. 从快照恢复
etcdctl snapshot restore ${BACKUP_FILE} \
--data-dir=/var/lib/etcd-new \
--name=k8s-cluster241 \
--initial-cluster="k8s-cluster241=https://10.0.0.241:2380,k8s-cluster242=https://10.0.0.242:2380,k8s-cluster243=https://10.0.0.243:2380" \
--initial-cluster-token="etcd-k8s-cluster" \
--initial-advertise-peer-urls="https://10.0.0.241:2380"
# 4. 更新配置文件指向新数据目录
sed -i 's#/var/lib/etcd#/var/lib/etcd-new#g' /xiaozhi/softwares/etcd/etcd.config.yml
# 5. 启动服务
systemctl start etcd
# 6. 验证数据恢复
etcdctl get /test --prefix
7.3 跨集群迁移
# 从源集群备份
etcdctl --endpoints=<source-cluster> snapshot save snapshot.db
# 恢复到目标集群(每个节点)
etcdctl snapshot restore snapshot.db \
--data-dir /var/lib/etcd-new \
--name <member-name> \
--initial-cluster <new-cluster-config> \
--initial-cluster-token <new-token> \
--initial-advertise-peer-urls <peer-url>
八、监控与告警
8.1 健康检查端点
etcd 提供内置的健康检查接口:
# HTTP 健康检查
curl -k https://10.0.0.241:2379/health
# 输出:{"health":"true"}
# 详细健康状态
curl -k https://10.0.0.241:2379/health?detailed=true
# 指标端点(Prometheus)
curl -k https://10.0.0.241:2379/metrics
8.2 关键监控指标
创建监控脚本:
cat > /usr/local/sbin/etcd-monitor.sh <<'EOF'
#!/bin/bash
# etcd 集群监控脚本
ENDPOINTS="10.0.0.241:2379,10.0.0.242:2379,10.0.0.243:2379"
CACERT="/xiaozhi/certs/etcd/etcd-ca.pem"
CERT="/xiaozhi/certs/etcd/etcd-server.pem"
KEY="/xiaozhi/certs/etcd/etcd-server-key.pem"
echo "=== etcd Cluster Status ==="
date
echo
# 检查集群状态
etcdctl endpoint status --endpoints=${ENDPOINTS} \
--cacert=${CACERT} --cert=${CERT} --key=${KEY} \
--write-out=table
echo
echo "=== Cluster Health ==="
for endpoint in $(echo ${ENDPOINTS} | tr ',' ' '); do
health=$(etcdctl endpoint health --endpoints=${endpoint} \
--cacert=${CACERT} --cert=${CERT} --key=${KEY} 2>/dev/null || echo "unhealthy")
echo "${endpoint}: ${health}"
done
echo
echo "=== Alarm List ==="
etcdctl alarm list --endpoints=${ENDPOINTS} \
--cacert=${CACERT} --cert=${CERT} --key=${KEY}
echo
echo "=== DB Size ==="
etcdctl endpoint status --endpoints=${ENDPOINTS} \
--cacert=${CACERT} --cert=${CERT} --key=${KEY} \
--write-out=json | jq -r '.Status[] | "\(.Endpoint): \(.DbSize)"'
EOF
chmod +x /usr/local/sbin/etcd-monitor.sh
8.3 Prometheus 监控配置
# prometheus.yml 配置示例
scrape_configs:
- job_name: 'etcd'
scheme: https
tls_config:
ca_file: /xiaozhi/certs/etcd/etcd-ca.pem
cert_file: /xiaozhi/certs/etcd/etcd-server.pem
key_file: /xiaozhi/certs/etcd/etcd-server-key.pem
insecure_skip_verify: true
static_configs:
- targets:
- '10.0.0.241:2379'
- '10.0.0.242:2379'
- '10.0.0.243:2379'
九、性能调优
9.1 硬件要求
| 指标 | 最小要求 | 推荐配置 | 生产环境 |
|---|---|---|---|
| CPU | 2 核心 | 4 核心 | 8+ 核心 |
| 内存 | 4 GB | 8 GB | 16+ GB |
| 磁盘 | 100 GB SSD | 200 GB NVMe | 500 GB NVMe RAID |
| IOPS | 500 | 1500 | 5000+ |
| 延迟 | < 10ms | < 5ms | < 1ms |
9.2 关键参数调优
# 生产环境调优配置
# /xiaozhi/softwares/etcd/etcd.config.yml
# 性能相关参数
snapshot-count: 100000 # 提高快照阈值
quota-backend-bytes: 8589934592 # 8GB 存储限制
max-request-bytes: 15728640 # 15MB 请求限制
max-txn-ops: 32768 # 事务操作限制
# 网络调优
heartbeat-interval: 150 # 心跳间隔
election-timeout: 1500 # 选举超时
initial-election-tick-advance: true
# 磁盘优化
auto-compaction-mode: periodic
auto-compaction-retention: "1h" # 每小时自动压缩
enable-grpc-gateway: true
9.3 性能测试
# 安装基准测试工具
go install go.etcd.io/etcd/tools/benchmark@latest
# 运行基准测试
benchmark \
--endpoints="https://10.0.0.241:2379" \
--target-leader \
--conns=100 \
--clients=1000 \
put \
--key-size=8 \
--sequential-keys \
--total=100000 \
--val-size=256
# 测试结果解读:
# - 平均延迟:应 < 10ms
# - QPS:应 > 10000
# - 成功率:应 = 100%
十、故障排除指南
10.1 常见问题与解决
问题1:etcd 启动失败
# 查看详细日志
journalctl -u etcd -xe --no-pager
# 常见原因:
# 1. 证书问题
openssl x509 -in /xiaozhi/certs/etcd/etcd-server.pem -text -noout
# 2. 端口冲突
ss -tlnp | grep -E '2379|2380'
# 3. 数据目录权限
ls -la /var/lib/etcd/
chown -R etcd:etcd /var/lib/etcd/
问题2:集群脑裂
# 检查各节点状态
for node in 241 242 243; do
echo "Node 10.0.0.${node}:"
etcdctl --endpoints=10.0.0.${node}:2379 endpoint status
done
# 解决方案:
# 1. 停止少数派节点
# 2. 在多数派上移除故障节点
# 3. 清理数据并重新加入
问题3:磁盘空间不足
# 检查磁盘使用
df -h /var/lib/etcd
# 清理旧数据
etcdctl defrag --endpoints=localhost:2379
# 设置存储配额
etcdctl endpoint status --write-out=json | \
jq -r '.Status[] | "\(.Endpoint): \(.DbSize)"'
10.2 诊断工具
# 1. 查看内部状态
etcdctl check perf
# 2. 诊断命令
etcdctl debug
# 3. 生成诊断报告
etcdctl diagnostic --output-file=etcd-report.tar.gz
# 4. 性能分析(需要启用 pprof)
go tool pprof http://localhost:2379/debug/pprof/profile
十一、安全加固
11.1 网络访问控制
# 使用 iptables 限制访问
iptables -A INPUT -p tcp --dport 2379 -s 10.0.0.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2380 -s 10.0.0.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2379 -j DROP
iptables -A INPUT -p tcp --dport 2380 -j DROP
11.2 启用审计日志
# 在配置中添加审计日志
audit-log: /var/log/etcd/audit.log
audit-log-maxage: 30
audit-log-maxbackups: 10
audit-log-maxsize: 100
11.3 定期轮换证书
# 证书轮换脚本示例
cat > /usr/local/sbin/rotate-etcd-certs.sh <<'EOF'
#!/bin/bash
# 生成新证书
cfssl gencert \
-ca=/xiaozhi/certs/etcd/etcd-ca.pem \
-ca-key=/xiaozhi/certs/etcd/etcd-ca-key.pem \
-config=ca-config.json \
--hostname=127.0.0.1,k8s-cluster241,k8s-cluster242,k8s-cluster243,10.0.0.241,10.0.0.242,10.0.0.243 \
--profile=etcd \
etcd-csr.json | cfssljson -bare /xiaozhi/certs/etcd/etcd-server-new
# 逐步重启节点(滚动更新)
for node in 241 242 243; do
scp /xiaozhi/certs/etcd/etcd-server-new*.pem root@10.0.0.${node}:/xiaozhi/certs/etcd/
ssh root@10.0.0.${node} "mv /xiaozhi/certs/etcd/etcd-server.pem /xiaozhi/certs/etcd/etcd-server-old.pem"
ssh root@10.0.0.${node} "mv /xiaozhi/certs/etcd/etcd-server-new.pem /xiaozhi/certs/etcd/etcd-server.pem"
ssh root@10.0.0.${node} "systemctl restart etcd"
sleep 10
done
EOF
总结
关键收获
- 证书体系是基础:正确的 TLS 证书配置是 etcd 集群安全运行的基石
- 配置一致性:确保所有节点的配置文件中集群信息完全一致
- 备份是生命线:定期备份和测试恢复流程至关重要
- 监控不可少:实时监控集群状态和性能指标
最佳实践清单
- 使用奇数个节点(3、5、7)
- 配置自动备份和保留策略
- 启用监控和告警
- 定期测试故障转移
- 保持 etcd 版本与 K8S 兼容
- 使用 SSD/NVMe 存储
- 分离 etcd 与业务网络
- 定期进行恢复演练
下一步行动
- 验证 etcd 集群健康状态
- 配置自动备份策略
- 设置监控告警
- 进行故障转移测试
- 文档化恢复流程
经验分享:生产环境中,建议将 etcd 部署在专用的机器上,与控制平面分离,避免资源竞争影响集群稳定性。更新策略