故障描述
下午家里停电,恢复供电之后发现K8S集群不能自动启动了,检查发现3个ETCD节点服务不能启动,报错日志如下:
Jun 24 19:45:05 etcd1 systemd[1]: Starting Etcd Server...
-- Subject: Unit etcd.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit etcd.service has begun starting up.
Jun 24 19:45:05 etcd1 etcd[1453]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://10.0.0.150:2379
Jun 24 19:45:05 etcd1 etcd[1453]: recognized and used environment variable ETCD_CERT_FILE=/opt/kubernetes/ssl/etcd.pem
Jun 24 19:45:05 etcd1 etcd[1453]: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=https://10.0.0.150:2380
Jun 24 19:45:05 etcd1 etcd[1453]: recognized and used environment variable ETCD_INITIAL_CLUSTER=etcd1=https://10.0.0.150:2380,etcd2=https://10.0.0.151:2380,etcd3=https://10.0.0.152:2380
Jun 24 19:45:05 etcd1 etcd[1453]: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=new
Jun 24 19:45:05 etcd1 etcd[1453]: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=bigboss1
Jun 24 19:45:05 etcd1 etcd[1453]: recognized and used environment variable ETCD_KEY_FILE=/opt/kubernetes/ssl/etcd-key.pem
Jun 24 19:45:05 etcd1 etcd[1453]: recognized and used environment variable ETCD_LISTEN_PEER_URLS=https://10.0.0.150:2380
Jun 24 19:45:05 etcd1 etcd[1453]: recognized and used environment variable ETCD_PEER_CERT_FILE=/opt/kubernetes/ssl/etcd.pem
Jun 24 19:45:05 etcd1 etcd[1453]: recognized and used environment variable ETCD_PEER_KEY_FILE=/opt/kubernetes/ssl/etcd-key.pem
Jun 24 19:45:05 etcd1 etcd[1453]: recognized and used environment variable ETCD_PEER_TRUSTED_CA_FILE=/opt/kubernetes/ssl/ca.pem
Jun 24 19:45:05 etcd1 etcd[1453]: recognized and used environment variable ETCD_TRUSTED_CA_FILE=/opt/kubernetes/ssl/ca.pem
Jun 24 19:45:05 etcd1 etcd[1453]: recognized environment variable ETCD_NAME, but unused: shadowed by corresponding flag
Jun 24 19:45:05 etcd1 etcd[1453]: recognized environment variable ETCD_DATA_DIR, but unused: shadowed by corresponding flag
Jun 24 19:45:05 etcd1 etcd[1453]: recognized environment variable ETCD_LISTEN_CLIENT_URLS, but unused: shadowed by corresponding flag
Jun 24 19:45:05 etcd1 etcd[1453]: etcd Version: 3.3.11
Jun 24 19:45:05 etcd1 etcd[1453]: Git SHA: 2cf9e51
Jun 24 19:45:05 etcd1 etcd[1453]: Go Version: go1.10.3
Jun 24 19:45:05 etcd1 etcd[1453]: Go OS/Arch: linux/amd64
Jun 24 19:45:05 etcd1 etcd[1453]: setting maximum number of CPUs to 2, total number of available CPUs is 2
Jun 24 19:45:05 etcd1 etcd[1453]: the server is already initialized as member before, starting as etcd member...
Jun 24 19:45:05 etcd1 etcd[1453]: peerTLS: cert = /opt/kubernetes/ssl/etcd.pem, key = /opt/kubernetes/ssl/etcd-key.pem, ca = , trusted-ca = /opt/kubernetes/ssl/ca.pem, client-cert-auth = false, crl-file =
Jun 24 19:45:05 etcd1 etcd[1453]: listening for peers on https://10.0.0.150:2380
Jun 24 19:45:05 etcd1 etcd[1453]: listening for client requests on 127.0.0.1:2379
Jun 24 19:45:05 etcd1 etcd[1453]: listening for client requests on 10.0.0.150:2379
Jun 24 19:45:05 etcd1 etcd[1453]: recovered store from snapshot at index 200002
Jun 24 19:45:05 etcd1 etcd[1453]: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
Jun 24 19:45:05 etcd1 bash[1453]: panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
Jun 24 19:45:05 etcd1 bash[1453]: panic: runtime error: invalid memory address or nil pointer dereference
Jun 24 19:45:05 etcd1 bash[1453]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x56172ea677c2]
Jun 24 19:45:05 etcd1 bash[1453]: goroutine 1 [running]:
Jun 24 19:45:05 etcd1 bash[1453]: panic(0x56172f115e20, 0x56172f778590)
Jun 24 19:45:05 etcd1 bash[1453]: /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/panic.go:551 +0x3c5 fp=0xc42021b0f8 sp=0xc42021b058 pc=0x56172e2dfe25
Jun 24 19:45:05 etcd1 bash[1453]: runtime.panicmem()
Jun 24 19:45:05 etcd1 bash[1453]: /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/panic.go:63 +0x60 fp=0xc42021b118 sp=0xc42021b0f8 pc=0x56172e2decc0
Jun 24 19:45:05 etcd1 bash[1453]: runtime.sigpanic()
分析
日志中的关键内容是“recovering backend from snapshot error: database snapshot file path error
”,字面意思很明显就是数据库文件损坏了,百度一下有人的做法是删除损坏的文件让它重新自动生成新的数据库,这不是开玩笑嘛,假设这是生产环境能这么玩的吗?
不过文章里也提供了snap文件的路径是/var/lib/etcd/default.etcd/member
,我把三个节点中的文件对比了一下,其中第2节点的文件是最新的,且“database snapshot file path error
”报错日志只有1、3节点上有,2节点没有这个报错
处理
将ETCD集群3个节点全部停掉,把2节点/var/lib/etcd/default.etcd/member
下面所有的文件复制到另外的两个节点上,重新启动后恢复正常
总结
有问题别动不动推倒重来,练习过程就要模拟自己面对的是生产环境
是数据库就得备份,不然悔之晚矣
最重要的还是要定时备份,如果遇到极端情况,所有节点全部损坏,那就凉凉了。
etcd数据定时备份
#!/bin/sh
cd /var/lib
name="etcd-bak"`date "+%Y%m%d"`
tar -cvf "/home/etcd/"$name".tar.gz" etcd
crontab
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
59 23 * * * (/path/to/backup.sh)
有不同意见欢迎,下方评论区见(评论无需注册)