This guide provides a step-by-step process to diagnose and resolve high memory issues causing NextGen Gateway pods to crash in a Kubernetes environment. It includes commands to check the pod status, identify memory-related issues, and implement solutions to stabilize the pod.
Verifying the Memory usage
To verify the memory usage in Kubernetes pods, ensure that the metrics server is enabled in your Kubernetes cluster. The “kubectl top“ command can be used to retrieve snapshots of resource utilization for pods or nodes in your Kubernetes cluster.
Check Pod Memory Usage
Run the following command to check the memory usage of the pod:
$ kubectl top pods
NAME CPU(cores) MEMORY(bytes)
nextgen-gw-0 48m 1375Mi
nextgen-gw-redis-master-0 11m 11Mi Check Container Memory Usage
Run the following command to check the memory usage of the node:
$ kubectl top pods --containers
POD NAME CPU(cores) MEMORY(bytes)
nextgen-gw-0 nativebridge 0m 6Mi
nextgen-gw-0 postgres 5m 83Mi
nextgen-gw-0 vprobe 46m 633Mi
nextgen-gw-redis-master-0 redis 13m 11Mi Check Node Memory Usage
Run the following command to check the memory usage of the node:
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
nextgen-gateway 189m 9% 3969Mi 49% Understanding Pod Crashes Due to High Memory Usage
The NextGen Gateway pod in a Kubernetes cluster crashes due to high memory usage.
Possible Causes
When a pod exceeds its allocated memory, Kubernetes automatically kills the process to protect the node’s stability, resulting in an OOMKilled (Out of Memory Killed) error. This is particularly critical for the NextGen Gateway, as it may affect the stability and monitoring capabilities of the OpsRamp platform.
Troubleshooting Steps
Follow these steps to diagnose and resolve memory issues for the NextGen Gateway pod:
- Check the status of Kubernetes objects to determine if pods are running or not.
- Gather detailed information about the pod by running the following command. This will provide the status, restart count, and the reason for any previous restarts:
Example:kubectl describe pod <pod_name>kubectl describe pod nextgen-gw-0 - Examine memory-related termination reasons in the pod’s event logs.
Sample Log Output:vprobe: Container ID: containerd://40c8585cf88dc7d0dd4e43560dc631ef559b0c92e6d5d429719a384aaea77777 Image: us-central1-docker.pkg.dev/opsramp-registry/gateway-cluster-images/vprobe:17.0.0 Image ID: us-central1-docker.pkg.dev/opsramp-registry/gateway-cluster-images/vprobe@sha256:8de1a98c3c14307fa4882c7e7422a1a4e4d507d2bbc454b53f905062b665e9d2 Port: <none> Host Port: <none> State: Running Started: Mon, 29 Jan 2024 12:01:30 +0530 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Mon, 29 Jan 2024 12:00:42 +0530 Finished: Mon, 29 Jan 2024 12:01:29 +0530 Ready: True Restart Count: 1 - Confirm memory issue by Exit Code.
If the exit code is 137, this indicates that the pod has crashed due to a memory issue
Resolution for Memory Issues
To resolve the memory issue and prevent further pod crashes, take the following actions:
- Decrease the load on the NextGen Gateway by limiting the number of metrics being processed.
- Adjust the memory limits for the NextGen Gateway pod, ensuring it has sufficient memory to handle the required load without crashing. For detailed instructions on modifying the memory limits, refer to the Update Memory Limits for NextGen Gateway section.