kubernetes节点故障重新加入操作

# Kubernetes 节点脏数据清理

# 一、驱逐节点上的工作负载

# 先 cordon，禁止新 pod 调度
kubectl cordon <node-name>

# 驱逐所有 pod（忽略 DaemonSet，删除本地存储）
kubectl drain <node-name> \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=30

1
2
3
4
5
6
7
8
9

# 二、从集群中移除节点

# 在 control plane 上删除节点对象
kubectl delete node <node-name>

1
2

# 三、在节点上重置 kubeadm（节点本机执行）

kubeadm reset -f

会清理：

etcd 数据（worker 节点无）
CNI 配置 /etc/cni/net.d/
kubelet 状态
kube-proxy iptables 规则

# 四、手动清理残留数据

# 网络规则

iptables -F && iptables -X
iptables -t nat -F && iptables -t nat -X
iptables -t mangle -F && iptables -t mangle -X
ipvsadm --clear 2>/dev/null || true

1
2
3
4

# CNI 网络接口

# 查看残留虚拟网卡
ip link show

# 删除常见 CNI 残留接口（按实际 CNI 调整）
ip link delete cni0       2>/dev/null || true
ip link delete flannel.1  2>/dev/null || true
ip link delete calico*    2>/dev/null || true
ip link delete tunl0      2>/dev/null || true
ip link delete vxlan.calico 2>/dev/null || true

1
2
3
4
5
6
7
8
9

# 残留目录和文件

# kubelet 数据
rm -rf /var/lib/kubelet/*

# CNI 配置
rm -rf /etc/cni/net.d/*

# kubeadm 配置
rm -rf /etc/kubernetes/*

# 容器运行时残留（containerd）
crictl rm $(crictl ps -aq) 2>/dev/null || true
crictl rmi $(crictl images -q) 2>/dev/null || true

# Pod 挂载残留（重要！）
umount $(mount | grep '/var/lib/kubelet' | awk '{print $3}') 2>/dev/null || true
rm -rf /var/lib/kubelet/pods/*

# 本地 PV 数据（按实际路径）
rm -rf /var/lib/rancher/
rm -rf /var/lib/etcd/

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

# containerd 残留

systemctl stop containerd

rm -rf /var/lib/containerd/*

systemctl start containerd

1
2
3
4
5

# 五、重启服务 & 检查

systemctl daemon-reload
systemctl restart containerd
systemctl restart kubelet   # 此时 kubelet 会失败，正常，尚未 join

# 确认网络接口干净
ip link show | grep -E 'cni|flannel|calico|tunl|vxlan'

# 确认无残留挂载
mount | grep kubelet

1
2
3
4
5
6
7
8
9

# 六、重新加入集群

清理完成后重新执行 join：

kubeadm join 192.168.1.100:6443 \
  --token <token> \
  --discovery-token-ca-cert-hash sha256:<hash>

1
2
3

# 清理范围速查

类型	命令/路径	说明
iptables 规则	`iptables -F` 系列	防止旧规则干扰新网络
CNI 接口	`ip link delete`	cni0、flannel.1 等虚拟网卡
kubelet 数据	`/var/lib/kubelet/*`	pod 状态、volume 挂载记录
kubernetes 配置	`/etc/kubernetes/*`	证书、kubeconfig
containerd 数据	`/var/lib/containerd/*`	镜像、容器层数据
CNI 配置	`/etc/cni/net.d/*`	CNI 插件配置文件

如果节点用了 本地 PV（local-path / hostPath），清理前确认数据已备份或 PVC 已迁移，避免误删业务数据。

# 七、用AI修复报错信息

# 报错一

[root@k8s-node03 ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; preset: disabled)
    Drop-In: /usr/lib/systemd/system/kubelet.service.d
             └─10-kubeadm.conf
     Active: activating (auto-restart) (Result: exit-code) since Sat 2026-06-06 15:52:50 CST; 5s ago
       Docs: https://kubernetes.io/docs/
    Process: 15253 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=1/FAILURE)
   Main PID: 15253 (code=exited, status=1/FAILURE)
        CPU: 34ms
[root@k8s-node03 ~]# journalctl -f
Jun 06 15:52:50 k8s-node03 kubelet[15253]: E0606 15:52:50.941185   15253 run.go:72] "command failed" err="failed to load kubelet config file, path: /var/lib/kubelet/config.yaml, error: failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory"
Jun 06 15:52:50 k8s-node03 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Jun 06 15:52:50 k8s-node03 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jun 06 15:53:00 k8s-node03 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 129.
Jun 06 15:53:00 k8s-node03 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Jun 06 15:53:00 k8s-node03 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jun 06 15:53:01 k8s-node03 kubelet[15261]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Jun 06 15:53:01 k8s-node03 kubelet[15261]: E0606 15:53:01.010479   15261 run.go:72] "command failed" err="failed to load kubelet config file, path: /var/lib/kubelet/config.yaml, error: failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory"
Jun 06 15:53:01 k8s-node03 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Jun 06 15:53:01 k8s-node03 systemd[1]: kubelet.service: Failed with result 'exit-code'.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

这个错误很明确：/var/lib/kubelet/config.yaml 不存在，说明 kubeadm reset 把它清掉了，但节点还没重新 join，kubelet 还在空转重启。

正常现象，不需要手动创建这个文件，kubeadm join 执行时会自动生成它。

直接执行 join 命令即可：

# 先在 control plane 上获取最新 join 命令
kubeadm token create --print-join-command

1
2

然后在 k8s-node03 上执行输出的 join 命令：

kubeadm join 192.168.x.x:6443 \
  --token <token> \
  --discovery-token-ca-cert-hash sha256:<hash>

1
2
3

join 成功后 kubelet 会自动收到 /var/lib/kubelet/config.yaml 并正常启动。

# 报错二

[root@k8s-node03 ~]# kubeadm join 192.168.51.8:6443 --token 2nspem.bg26kezi09o18hbs --discovery-token-ca-cert-hash sha256:08f52fb5689fc4f4e51e8b2aef1832cc5dbf63dda7a57b26956b259834e75258
[preflight] Running pre-flight checks
[preflight] Reading configuration from the "kubeadm-config" ConfigMap in namespace "kube-system"...
[preflight] Use 'kubeadm init phase upload-config --config your-config.yaml' to re-upload it.
error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get component configs: could not download the kubelet configuration from ConfigMap "kubelet-config": configmaps "kubelet-config" is forbidden: User "system:bootstrap:2nspem" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
To see the stack trace of this error execute with --v=5 or higher

1
2
3
4
5
6

这个错误是 bootstrap token 权限不足，无法读取 kubelet-config ConfigMap。

通常原因是 system:bootstrappers 组缺少对应的 ClusterRoleBinding。

# 在 Control Plane 上修复

# 第一步：检查 ClusterRoleBinding 是否存在

kubectl get clusterrolebinding kubeadm:kubelet-bootstrap
kubectl get clusterrolebinding kubeadm:node-autoapprove-bootstrap
kubectl get clusterrolebinding kubeadm:node-autoapprove-certificate-rotation

1
2
3

# 第二步：如果缺失，重新创建

# bootstrap token 允许读取 kubelet-config
kubectl create clusterrolebinding kubeadm:kubelet-bootstrap \
  --clusterrole=system:node-bootstrapper \
  --group=system:bootstrappers

# 允许自动审批 CSR
kubectl create clusterrolebinding kubeadm:node-autoapprove-bootstrap \
  --clusterrole=system:certificates.k8s.io:certificatesigningrequests:nodeclient \
  --group=system:bootstrappers

kubectl create clusterrolebinding kubeadm:node-autoapprove-certificate-rotation \
  --clusterrole=system:certificates.k8s.io:certificatesigningrequests:selfnodeclient \
  --group=system:nodes

1
2
3
4
5
6
7
8
9
10
11
12
13

# 第三步：补充 kubelet-config 读取权限

kubectl create clusterrolebinding kubeadm:node-proxier \
  --clusterrole=system:node-proxier \
  --group=system:bootstrappers 2>/dev/null || true

# 核心：允许 bootstrappers 读取 kubelet-config
kubectl patch clusterrole system:node-bootstrapper \
  --type=json \
  -p='[{"op":"add","path":"/rules/-","value":{"apiGroups":[""],"resources":["configmaps"],"verbs":["get"],"resourceNames":["kubelet-config"]}}]' 2>/dev/null || true

1
2
3
4
5
6
7
8

或者更直接，用 kubeadm 恢复默认 RBAC：

kubeadm init phase bootstrap-token

这条命令会重建所有 bootstrap 相关的 RBAC，不影响现有节点，是最干净的修复方式。

# 第四步：回到 node03 重新 join

kubeadm join 192.168.51.8:6443 \
  --token 2nspem.bg26kezi09o18hbs \
  --discovery-token-ca-cert-hash sha256:08f52fb5689fc4f4e51e8b2aef1832cc5dbf63dda7a57b26956b259834e75258

1
2
3

# 报错三

[root@k8s-master01 ~]# kubectl get clusterrolebinding kubeadm:kubelet-bootstrap
NAME                        ROLE                                   AGE
kubeadm:kubelet-bootstrap   ClusterRole/system:node-bootstrapper   17d
[root@k8s-master01 ~]# kubectl get clusterrolebinding kubeadm:node-autoapprove-bootstrap
NAME                                 ROLE                                                                           AGE
kubeadm:node-autoapprove-bootstrap   ClusterRole/system:certificates.k8s.io:certificatesigningrequests:nodeclient   17d
[root@k8s-master01 ~]# kubectl get clusterrolebinding kubeadm:node-autoapprove-certificate-rotation
NAME                                            ROLE                                                                               AGE
kubeadm:node-autoapprove-certificate-rotation   ClusterRole/system:certificates.k8s.io:certificatesigningrequests:selfnodeclient   17d

1
2
3
4
5
6
7
8
9

是的，在 control plane（master）节点上执行。

[root@k8s-master01 ~]# kubeadm init phase bootstrap-token

执行完后回 node03 重新 join 就行。

# 报错四

[root@k8s-master01 ~]# kubeadm init phase bootstrap-token
W0606 15:46:32.365998  192664 version.go:109] could not fetch a Kubernetes version from the internet: unable to get URL "https://dl.k8s.io/release/stable-1.txt": Get "https://dl.k8s.io/release/stable-1.txt": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
W0606 15:46:32.366081  192664 version.go:110] falling back to the local client version: v1.32.12
[bootstrap-token] Using token: 2d01x8.pjqe9l4xvqu3ritt
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[root@k8s-master01 ~]# kubectl describe clusterrole system:node-bootstrapper
Name:         system:node-bootstrapper
Labels:       kubernetes.io/bootstrapping=rbac-defaults
Annotations:  rbac.authorization.kubernetes.io/autoupdate: true
PolicyRule:
  Resources                                       Non-Resource URLs  Resource Names  Verbs
  ---------                                       -----------------  --------------  -----
  certificatesigningrequests.certificates.k8s.io  []                 []              [create get list watch]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

RBAC 已修复成功。但注意 system:node-bootstrapper 里没有 configmaps 的读取权限，读取 kubelet-config 是通过另一个独立的 ClusterRoleBinding 来授权的。

检查一下这个是否存在：

kubectl get clusterrolebinding kubeadm:get-nodes

以及：

kubectl get rolebinding -n kube-system | grep kubelet-config

不过现在 bootstrap-token 已经跑完了，直接先去 node03 重新 join 试试，大概率已经可以了：

[root@k8s-node03 ~]# kubeadm join 192.168.51.8:6443 \
  --token 2nspem.bg26kezi09o18hbs \
  --discovery-token-ca-cert-hash sha256:08f52fb5689fc4f4e51e8b2aef1832cc5dbf63dda7a57b26956b259834e75258

1
2
3

把结果贴出来看。

[root@k8s-master01 ~]# kubectl get clusterrolebinding kubeadm:get-nodes
NAME                ROLE                            AGE
kubeadm:get-nodes   ClusterRole/kubeadm:get-nodes   17d
[root@k8s-master01 ~]# kubectl get rolebinding -n kube-system | grep kubelet-config
[root@k8s-master01 ~]#

1
2
3
4
5

rolebinding 里 kubelet-config 相关的条目为空，这就是根本原因。需要手动补上这个 RoleBinding：

kubectl create rolebinding kubeadm:kubelet-config \
  --namespace=kube-system \
  --role=kubeadm:kubelet-config \
  --group=system:bootstrappers \
  --group=system:nodes

1
2
3
4
5

先确认对应的 Role 存在：

kubectl get role kubeadm:kubelet-config -n kube-system

如果 Role 也不存在，一并创建：

kubectl create role kubeadm:kubelet-config \
  --namespace=kube-system \
  --verb=get \
  --resource=configmaps \
  --resource-name=kubelet-config

1
2
3
4
5

然后再创建 RoleBinding，之后回 node03 重新 join。

# 报错五

[root@k8s-node03 ~]# kubeadm join 192.168.51.8:6443 --token 2nspem.bg26kezi09o18hbs --discovery-token-ca-cert-hash sha256:08f52fb5689fc4f4e51e8b2aef1832cc5dbf63dda7a57b26956b259834e75258
[preflight] Running pre-flight checks
[preflight] Reading configuration from the "kubeadm-config" ConfigMap in namespace "kube-system"...
[preflight] Use 'kubeadm init phase upload-config --config your-config.yaml' to re-upload it.
error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get component configs: could not download the kubelet configuration from ConfigMap "kubelet-config": configmaps "kubelet-config" not found
To see the stack trace of this error execute with --v=5 or higher

1
2
3
4
5
6

错误变了，现在是 kubelet-config ConfigMap 本身不存在，需要重新上传。

在 master 上执行：

kubeadm init phase upload-config kubelet

执行后确认 ConfigMap 已创建：

kubectl get configmap kubelet-config -n kube-system

然后回 node03 重新 join即可

上次更新: 2026/06/07, 07:08:16

← kubernetes部署python项目 kubernetes部署go服务二→