Surviving OOM in Kubernetes: Java Applications

Yonahdissen
4 min readFeb 20, 2024

--

In the realm of Java applications, the OutOfMemoryError (OOM) issue is a challenge, occurring when the application exhausts its allocated memory. Traditionally, when dealing with OOM in a standard Java setup, the process involves triggering a heap dump—a snapshot of the application’s memory at a specific point in time. This diagnostic tool helps developers pinpoint memory-related issues by providing insights into the state of the application’s memory usage.

In Kubernetes, when a Java app faces an OOM, the platform kicks in with a restart to keep things running smoothly. Unlike traditional setups, where OOM could lead to a complete shutdown, Kubernetes automatically bounces back, ensuring continuous availability. This auto-restart feature adds a layer of complexity when trying to capture heap dumps, requiring a way to catch the heap dump before the pod is deleted.

Approach 1:

Trigger a heap dump before the OOM happens.

This can be done multiple ways one of them being using Prometheus metrics.

You can choose a threshold e.g. 90% of the jvm memory and when that threshold is crossed a web hook or using tools such as Robusta trigger a heap dump.

Cons:

  • Requires developing an external web hook or installing Robusta in your cluster
  • Requires having Jmap in your application or allowing ephemeral containers. Both not common options for production

Approach 2:

Create a Heap dump when the pod gets an OOM.

But wait didn’t we just say the pod will get restarted when it gets the JVM max memory?

Thats where some JVM parameters comes in Handy. By setting the following JAVA_OPTS we can create heap dump as the kill signal is sent.

JAVA_OPTS = "XX:+HeapDumpOnOutOfMemoryError"

Great! Now we have a heap dump that is created inside the pod, but that isn’t much help to us as it will get deleted when the pod restarts.

To persist out heap dump for later analysis we’ll take a look at a couple options.

Sidecar:

For every pod containing a Java application we can have a sidecar which shares a volume between these containers. The job of the side car would be to take the heap dump and upload it to wherever we would like to persist our heap dumps.

In our JAVA_OPTS we would add something like this to put the heap dump in the shared volume

-XX:HeapDumpPath=/etc/shared-path/${pod_name}.hprof"

The architecture would look something like this:

Sidecar architecture

Cons:

  • Code maintenance - every pod will now need to have a sidecar and a mount between the 2 containers(we can do this with a Helm library)
  • Cpu and Memory usage - Each sidecar even if idle has some footprint and when running many pods this can quickly add up

These 2 cons lead up to a different approach

Network volume(e.g. NFS):

A single NFS volume can be created and mounted on all the cluster nodes. This will allow us to mount the relevant pods on the host path and direct the heap dumps to this path.

-XX:HeapDumpPath=/etc/efs-mount/${pod_name}.hprof"

Now all thats left is a seperate pod that is constantly running and monitoring the NFS volume and will copy it to our destination(e.g. an S3 bucket).

The architecture would now look like this:

Network Volume architecture

The monitoring pod can be a small python app like this:

import boto3
import os
import logging
import time

DUMP_FOLDER = '/etc/efs-mount'
BUCKET_PATH = 'heap-dumps'
BUCKET_NAME = 'my-app-heap-dumps'


def main():
files_list = os.listdir(DUMP_FOLDER)
s3 = boto3.resource('s3')

if len(files_list) == 0:
logging.info("No Heap dumps to upload")
else:
for file in files_list:
full_file_path = '{}/{}'.format(DUMP_FOLDER, file)
logging.info("Uploading {} to {}".format(file, BUCKET_NAME))
s3.meta.client.upload_file(full_file_path, '{}-{}'.format(BUCKET_NAME),
'{}/{}'.format(BUCKET_PATH, file))
os.remove(full_file_path)


if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
while True:
main()
time.sleep(300)

This solution provides a cheaper cost as it the is only one idle resource(the monitoring pod) and the mounted storage is essentially free(e.g. EFS)as you would never store very much and not for more than a few minutes.

Caveats:

  • For authenticating against cloud storage(e.g S3 bucket) I recommend using something like kube2iam.
  • Name the heap dumps with the pod name like in the example. The tricky part of this is it needs to be put in the pods entrypoint and not in the manifest args.
  • This does not address the OOM that comes from reaching the pods limits.

In summary, there are multiple ways of catching jvm OOMs just choose the one that fits your environment best.

--

--