kubernetes节点故障重新加入操作
# Kubernetes 节点脏数据清理
# 一、驱逐节点上的工作负载
# 先 cordon,禁止新 pod 调度
kubectl cordon <node-name>
# 驱逐所有 pod(忽略 DaemonSet,删除本地存储)
kubectl drain <node-name> \
--ignore-daemonsets \
--delete-emptydir-data \
--force \
--grace-period=30
2
3
4
5
6
7
8
9
# 二、从集群中移除节点
# 在 control plane 上删除节点对象
kubectl delete node <node-name>
2
# 三、在节点上重置 kubeadm(节点本机执行)
kubeadm reset -f
会清理:
- etcd 数据(worker 节点无)
- CNI 配置
/etc/cni/net.d/ - kubelet 状态
- kube-proxy iptables 规则
# 四、手动清理残留数据
# 网络规则
iptables -F && iptables -X
iptables -t nat -F && iptables -t nat -X
iptables -t mangle -F && iptables -t mangle -X
ipvsadm --clear 2>/dev/null || true
2
3
4
# CNI 网络接口
# 查看残留虚拟网卡
ip link show
# 删除常见 CNI 残留接口(按实际 CNI 调整)
ip link delete cni0 2>/dev/null || true
ip link delete flannel.1 2>/dev/null || true
ip link delete calico* 2>/dev/null || true
ip link delete tunl0 2>/dev/null || true
ip link delete vxlan.calico 2>/dev/null || true
2
3
4
5
6
7
8
9
# 残留目录和文件
# kubelet 数据
rm -rf /var/lib/kubelet/*
# CNI 配置
rm -rf /etc/cni/net.d/*
# kubeadm 配置
rm -rf /etc/kubernetes/*
# 容器运行时残留(containerd)
crictl rm $(crictl ps -aq) 2>/dev/null || true
crictl rmi $(crictl images -q) 2>/dev/null || true
# Pod 挂载残留(重要!)
umount $(mount | grep '/var/lib/kubelet' | awk '{print $3}') 2>/dev/null || true
rm -rf /var/lib/kubelet/pods/*
# 本地 PV 数据(按实际路径)
rm -rf /var/lib/rancher/
rm -rf /var/lib/etcd/
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# containerd 残留
systemctl stop containerd
rm -rf /var/lib/containerd/*
systemctl start containerd
2
3
4
5
# 五、重启服务 & 检查
systemctl daemon-reload
systemctl restart containerd
systemctl restart kubelet # 此时 kubelet 会失败,正常,尚未 join
# 确认网络接口干净
ip link show | grep -E 'cni|flannel|calico|tunl|vxlan'
# 确认无残留挂载
mount | grep kubelet
2
3
4
5
6
7
8
9
# 六、重新加入集群
清理完成后重新执行 join:
kubeadm join 192.168.1.100:6443 \
--token <token> \
--discovery-token-ca-cert-hash sha256:<hash>
2
3
# 清理范围速查
| 类型 | 命令/路径 | 说明 |
|---|---|---|
| iptables 规则 | iptables -F 系列 | 防止旧规则干扰新网络 |
| CNI 接口 | ip link delete | cni0、flannel.1 等虚拟网卡 |
| kubelet 数据 | /var/lib/kubelet/* | pod 状态、volume 挂载记录 |
| kubernetes 配置 | /etc/kubernetes/* | 证书、kubeconfig |
| containerd 数据 | /var/lib/containerd/* | 镜像、容器层数据 |
| CNI 配置 | /etc/cni/net.d/* | CNI 插件配置文件 |
如果节点用了 本地 PV(local-path / hostPath),清理前确认数据已备份或 PVC 已迁移,避免误删业务数据。
# 七、用AI修复报错信息
# 报错一
[root@k8s-node03 ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: activating (auto-restart) (Result: exit-code) since Sat 2026-06-06 15:52:50 CST; 5s ago
Docs: https://kubernetes.io/docs/
Process: 15253 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=1/FAILURE)
Main PID: 15253 (code=exited, status=1/FAILURE)
CPU: 34ms
[root@k8s-node03 ~]# journalctl -f
Jun 06 15:52:50 k8s-node03 kubelet[15253]: E0606 15:52:50.941185 15253 run.go:72] "command failed" err="failed to load kubelet config file, path: /var/lib/kubelet/config.yaml, error: failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory"
Jun 06 15:52:50 k8s-node03 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Jun 06 15:52:50 k8s-node03 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jun 06 15:53:00 k8s-node03 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 129.
Jun 06 15:53:00 k8s-node03 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Jun 06 15:53:00 k8s-node03 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jun 06 15:53:01 k8s-node03 kubelet[15261]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Jun 06 15:53:01 k8s-node03 kubelet[15261]: E0606 15:53:01.010479 15261 run.go:72] "command failed" err="failed to load kubelet config file, path: /var/lib/kubelet/config.yaml, error: failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory"
Jun 06 15:53:01 k8s-node03 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Jun 06 15:53:01 k8s-node03 systemd[1]: kubelet.service: Failed with result 'exit-code'.
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
这个错误很明确:/var/lib/kubelet/config.yaml 不存在,说明 kubeadm reset 把它清掉了,但节点还没重新 join,kubelet 还在空转重启。
正常现象,不需要手动创建这个文件,kubeadm join 执行时会自动生成它。
直接执行 join 命令即可:
# 先在 control plane 上获取最新 join 命令
kubeadm token create --print-join-command
2
然后在 k8s-node03 上执行输出的 join 命令:
kubeadm join 192.168.x.x:6443 \
--token <token> \
--discovery-token-ca-cert-hash sha256:<hash>
2
3
join 成功后 kubelet 会自动收到 /var/lib/kubelet/config.yaml 并正常启动。
# 报错二
[root@k8s-node03 ~]# kubeadm join 192.168.51.8:6443 --token 2nspem.bg26kezi09o18hbs --discovery-token-ca-cert-hash sha256:08f52fb5689fc4f4e51e8b2aef1832cc5dbf63dda7a57b26956b259834e75258
[preflight] Running pre-flight checks
[preflight] Reading configuration from the "kubeadm-config" ConfigMap in namespace "kube-system"...
[preflight] Use 'kubeadm init phase upload-config --config your-config.yaml' to re-upload it.
error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get component configs: could not download the kubelet configuration from ConfigMap "kubelet-config": configmaps "kubelet-config" is forbidden: User "system:bootstrap:2nspem" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
To see the stack trace of this error execute with --v=5 or higher
2
3
4
5
6
这个错误是 bootstrap token 权限不足,无法读取 kubelet-config ConfigMap。
通常原因是 system:bootstrappers 组缺少对应的 ClusterRoleBinding。
# 在 Control Plane 上修复
# 第一步:检查 ClusterRoleBinding 是否存在
kubectl get clusterrolebinding kubeadm:kubelet-bootstrap
kubectl get clusterrolebinding kubeadm:node-autoapprove-bootstrap
kubectl get clusterrolebinding kubeadm:node-autoapprove-certificate-rotation
2
3
# 第二步:如果缺失,重新创建
# bootstrap token 允许读取 kubelet-config
kubectl create clusterrolebinding kubeadm:kubelet-bootstrap \
--clusterrole=system:node-bootstrapper \
--group=system:bootstrappers
# 允许自动审批 CSR
kubectl create clusterrolebinding kubeadm:node-autoapprove-bootstrap \
--clusterrole=system:certificates.k8s.io:certificatesigningrequests:nodeclient \
--group=system:bootstrappers
kubectl create clusterrolebinding kubeadm:node-autoapprove-certificate-rotation \
--clusterrole=system:certificates.k8s.io:certificatesigningrequests:selfnodeclient \
--group=system:nodes
2
3
4
5
6
7
8
9
10
11
12
13
# 第三步:补充 kubelet-config 读取权限
kubectl create clusterrolebinding kubeadm:node-proxier \
--clusterrole=system:node-proxier \
--group=system:bootstrappers 2>/dev/null || true
# 核心:允许 bootstrappers 读取 kubelet-config
kubectl patch clusterrole system:node-bootstrapper \
--type=json \
-p='[{"op":"add","path":"/rules/-","value":{"apiGroups":[""],"resources":["configmaps"],"verbs":["get"],"resourceNames":["kubelet-config"]}}]' 2>/dev/null || true
2
3
4
5
6
7
8
或者更直接,用 kubeadm 恢复默认 RBAC:
kubeadm init phase bootstrap-token
这条命令会重建所有 bootstrap 相关的 RBAC,不影响现有节点,是最干净的修复方式。
# 第四步:回到 node03 重新 join
kubeadm join 192.168.51.8:6443 \
--token 2nspem.bg26kezi09o18hbs \
--discovery-token-ca-cert-hash sha256:08f52fb5689fc4f4e51e8b2aef1832cc5dbf63dda7a57b26956b259834e75258
2
3
# 报错三
[root@k8s-master01 ~]# kubectl get clusterrolebinding kubeadm:kubelet-bootstrap
NAME ROLE AGE
kubeadm:kubelet-bootstrap ClusterRole/system:node-bootstrapper 17d
[root@k8s-master01 ~]# kubectl get clusterrolebinding kubeadm:node-autoapprove-bootstrap
NAME ROLE AGE
kubeadm:node-autoapprove-bootstrap ClusterRole/system:certificates.k8s.io:certificatesigningrequests:nodeclient 17d
[root@k8s-master01 ~]# kubectl get clusterrolebinding kubeadm:node-autoapprove-certificate-rotation
NAME ROLE AGE
kubeadm:node-autoapprove-certificate-rotation ClusterRole/system:certificates.k8s.io:certificatesigningrequests:selfnodeclient 17d
2
3
4
5
6
7
8
9
是的,在 control plane(master)节点上执行。
[root@k8s-master01 ~]# kubeadm init phase bootstrap-token
执行完后回 node03 重新 join 就行。
# 报错四
[root@k8s-master01 ~]# kubeadm init phase bootstrap-token
W0606 15:46:32.365998 192664 version.go:109] could not fetch a Kubernetes version from the internet: unable to get URL "https://dl.k8s.io/release/stable-1.txt": Get "https://dl.k8s.io/release/stable-1.txt": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
W0606 15:46:32.366081 192664 version.go:110] falling back to the local client version: v1.32.12
[bootstrap-token] Using token: 2d01x8.pjqe9l4xvqu3ritt
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[root@k8s-master01 ~]# kubectl describe clusterrole system:node-bootstrapper
Name: system:node-bootstrapper
Labels: kubernetes.io/bootstrapping=rbac-defaults
Annotations: rbac.authorization.kubernetes.io/autoupdate: true
PolicyRule:
Resources Non-Resource URLs Resource Names Verbs
--------- ----------------- -------------- -----
certificatesigningrequests.certificates.k8s.io [] [] [create get list watch]
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
RBAC 已修复成功。但注意 system:node-bootstrapper 里没有 configmaps 的读取权限,读取 kubelet-config 是通过另一个独立的 ClusterRoleBinding 来授权的。
检查一下这个是否存在:
kubectl get clusterrolebinding kubeadm:get-nodes
以及:
kubectl get rolebinding -n kube-system | grep kubelet-config
不过现在 bootstrap-token 已经跑完了,直接先去 node03 重新 join 试试,大概率已经可以了:
[root@k8s-node03 ~]# kubeadm join 192.168.51.8:6443 \
--token 2nspem.bg26kezi09o18hbs \
--discovery-token-ca-cert-hash sha256:08f52fb5689fc4f4e51e8b2aef1832cc5dbf63dda7a57b26956b259834e75258
2
3
把结果贴出来看。
[root@k8s-master01 ~]# kubectl get clusterrolebinding kubeadm:get-nodes
NAME ROLE AGE
kubeadm:get-nodes ClusterRole/kubeadm:get-nodes 17d
[root@k8s-master01 ~]# kubectl get rolebinding -n kube-system | grep kubelet-config
[root@k8s-master01 ~]#
2
3
4
5
rolebinding 里 kubelet-config 相关的条目为空,这就是根本原因。需要手动补上这个 RoleBinding:
kubectl create rolebinding kubeadm:kubelet-config \
--namespace=kube-system \
--role=kubeadm:kubelet-config \
--group=system:bootstrappers \
--group=system:nodes
2
3
4
5
先确认对应的 Role 存在:
kubectl get role kubeadm:kubelet-config -n kube-system
如果 Role 也不存在,一并创建:
kubectl create role kubeadm:kubelet-config \
--namespace=kube-system \
--verb=get \
--resource=configmaps \
--resource-name=kubelet-config
2
3
4
5
然后再创建 RoleBinding,之后回 node03 重新 join。
# 报错五
[root@k8s-node03 ~]# kubeadm join 192.168.51.8:6443 --token 2nspem.bg26kezi09o18hbs --discovery-token-ca-cert-hash sha256:08f52fb5689fc4f4e51e8b2aef1832cc5dbf63dda7a57b26956b259834e75258
[preflight] Running pre-flight checks
[preflight] Reading configuration from the "kubeadm-config" ConfigMap in namespace "kube-system"...
[preflight] Use 'kubeadm init phase upload-config --config your-config.yaml' to re-upload it.
error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get component configs: could not download the kubelet configuration from ConfigMap "kubelet-config": configmaps "kubelet-config" not found
To see the stack trace of this error execute with --v=5 or higher
2
3
4
5
6
错误变了,现在是 kubelet-config ConfigMap 本身不存在,需要重新上传。
在 master 上执行:
kubeadm init phase upload-config kubelet
执行后确认 ConfigMap 已创建:
kubectl get configmap kubelet-config -n kube-system
然后回 node03 重新 join即可
|