Issue Summary
In a multi node/single node K3s cluster, one node is intermittently unresponsive:
kubectltimes out or hangs when querying the cluster.kubectl get nodesshows the affected node asNotReady, sometimes flipping toReady.- Node is unresponsive via Kubernetes control but possibly reachable via SSH.
Prerequisites
- SSH access to all nodes.
kubectlaccess from the control node (or locally using k3skubectl).- Root/sudo access on the affected node.
- Node hostname or IP for targeting the investigation.
Troubleshooting Steps
1. Verify Node Status
Run from a working node:
kubectl get nodes -o wideExpected behavior:
- One node shows
NotReadyor status changes betweenReady↔NotReady.
2. SSH Into the Affected Node
From any other node or your local system:
ssh <node-ip>If login is slow, this could indicate:
- High CPU/memory
- I/O issues
- Network issues
3. Check System Resources
Once you have SSH access to the affected node, your first step is to check the node’s overall system health. The following commands help you identify if the issue is due to resource exhaustion or system-level bottlenecks.
3a. Check Who Is Logged In and What They're Doingw
wExplanation:
- The
wcommand shows who is logged into the system and what they are doing. - It provides useful details like system uptime, load average, and the processes each user is running.
- At the top, it shows load averages - if these numbers are high (e.g., above the number of CPU cores), your system might be under heavy load.
Example Output:
15:05:12 up 10 days, 2:41, 2 users, load average: 5.68, 5.33, 5.27
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
root pts/0 10.0.0.5 13:02 1:25m 0.01s 0.01s -bash3b. Check Real-Time Resource Usagetop
topExplanation:
- Shows real-time CPU, memory, and process activity.
- Helps identify processes consuming excessive CPU or memory.
- The first few lines provide overall stats, while the lower section shows detailed info for each running process.
What to look for:
- %CPU or %MEM near 100 for a single process.
- Load average at the top - this should ideally be less than the total number of CPU cores.
3c. Check Memory Usagefree -m
free -mExplanation:
- Displays the amount of used and available memory in megabytes.
- Shows RAM and swap usage.
What to check:
- If
availablememory is very low (under 100 MB), the node may be out of memory. - If
swapusage is high, this may indicate memory pressure.
Example Output:
total used free shared buff/cache available
Mem: 2000 1800 50 10 150 100
Swap: 1024 900 1243d. Check Disk Usagedf -h .
df -h .Explanation:
- Shows disk space usage for all mounted filesystems in a human-readable format.
- Critical for checking if
/var,/, or/var/lib/containerdare full.
What to look for:
- If any mount point is at 100%, it can prevent Kubernetes and containerd from functioning properly.
- K3s stores a lot of container data under
/var/lib/rancheror/var/lib/containerd.
Example Output:
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 40G 38G 500M 99% /
tmpfs 500M 0 500M 0% /dev/shm3e. Check System Uptime and Loaduptime
uptimeExplanation:
- A quick summary of how long the system has been running and current CPU load averages.
- Load averages are shown for the last 1, 5, and 15 minutes.
How to interpret load average:
- If the number is higher than the total number of CPU cores, your system is overloaded.
- Example: A 4-core machine should ideally have a load average under 4.0.
Example Output:
15:20:32 up 10 days, 3:12, 2 users, load average: 6.11, 6.55, 6.804. Check K3s Service status and logs
K3s service:
sudo service k3s statusCheck k3s logs:
sudo journalctl -u k3s-agent -n 100Look for:
- Frequent restarts
- Segfaults
- Network/disk timeout logs
- Out-of-memory events
5. Restart Services (If Needed)
If resource usage is normal but kubelet/k3s-agent logs show issues:
sudo service k3s stop
sudo service k3s startWait a few seconds and re-check node status from another machine:
kubectl get nodes6. Review Kernel/System Logs
On the affected node:
dmesg | tail -n 50
sudo journalctl -xeLook for:
- OOM (Out of Memory) killer messages
- Kernel panics or segfaults
- Disk or network I/O errors
7. Drain and Reboot the Node (If Safe)
This process is typically done for maintenance, troubleshooting, or upgrades. It’s important to ensure no critical workloads are impacted before performing these steps.
7a. Drain the Nodekubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
- This command evicts all pods from the specified node so that it can be safely rebooted or updated.
--ignore-daemonsets: Prevents errors by not trying to evict DaemonSet-managed pods, which usually run on all nodes (like logging or monitoring agents).--delete-emptydir-data: Deletes data in pods that use emptyDir volumes (which are ephemeral), since these volumes will be lost anyway during node restart.
Note
This command will cordon the node (mark it unschedulable) automatically.7b. Drain the Nodesudo reboot
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data--ignore-daemonsets: Prevents errors by not trying to evict DaemonSet-managed pods, which usually run on all nodes (like logging or monitoring agents).--delete-emptydir-data: Deletes data in pods that use emptyDir volumes (which are ephemeral), since these volumes will be lost anyway during node restart.Note
This command will cordon the node (mark it unschedulable) automatically.sudo rebootExplanation:
- This command restarts the node (server/VM) at the OS level.
- It is used for applying OS updates, resolving stuck resources, or resetting node state.
7c. Uncordon the Nodekubectl uncordon <node-name>
- After the node comes back online, this command marks it as schedulable again, allowing pods to be placed back on it by Kubernetes.
- Without this, the node will remain “cordoned” (unschedulable), and no new pods will be scheduled on it.
kubectl uncordon <node-name>