“We think we have a memory leak.” Words no developer wants to hear.
Memory leaks can be some of the most difficult to solve bugs in software development. They lie in wait during testing only to rear their ugly heads when in a highly active production environment. Eating through memory, they force the customer to employ processes such as nightly restarts to ensure system stability. They are unwelcome guests at the party. So how do we kick them out?
The engineering team at Itential recently had experience with just such an issue. A high-profile customer was experiencing memory issues when running their automations at full tilt. Memory usage would grow until there was no more memory to be allocated and the system would crash. Thankfully the leak was gradual, but it still forced the customer to put in place the afore mentioned nightly restarts. We had some information from the customer and our quality engineers, but it was still a mystery as to what was causing the leak. We needed to isolate the problem, be able reproduce it, and gather metrics to show the problem was solved.
Isolate the Problem
The first thing we needed to answer was, “What do we think is causing the problem?” The customer thought it might have something to do with the high amount of API calls they were making. Well that is a start, but the Itential Automation Platform (IAP) has many different API calls. It could be any one of hundreds. Thankfully our quality engineers had run some tests and were able to make the leak occur when continuously calling one of our status APIs. In particular, `http://localhost:3000/status` would do the trick when called hundreds of times in a row. So, what does this call do?
Basically, this status API reaches out to certain services in the IAP ecosystem and queries their health. It proceeds to return the aggregated health status of those systems. What does that look like behind the scenes? When the API is called, the route is handled by the Express web server and a call is made into the IAP core system. The core then makes calls out to the monitored services. This is a network call so some context data must be saved during these calls. Once the calls return, the context is cleaned up and the status is passed back to whomever requested it. Or at least that is what should happen.
Knowing all this, we now thought we had a good idea of what the problem might be and began creating a method to reliably reproduce the issue. Though there was one caveat that one of our quality engineers put forth. They claimed the memory leak would occur without the network calls from core. Interesting. Time to do some testing of our own.
Reproduce the Problem
Creating a method to reproduce the problem turned out to be fairly straight forward. We wrote a little bash script that would hammer the API on our local environments.
curl -X GET http://127.0.0.1:3000/status?test=[1-2500] -H ‘Content-Type: application/json’ -H ‘cache-control: no-cache’
We monitored the memory usage and noticed it ever increasing while the script ran. Once the script completed, we tried the API again when the system wasn’t under load and the memory was not cleaned up. In this case, that is good. It looks like we can reproduced the leak locally.
But what about the quality engineer’s claim that the leak occurred even if the call did not go across the network? Well now we have a mechanism in place to test that claim. To do this, we essentially returned from the API call before the call into the core system was made thereby avoiding the network call and context management. Rerunning the same test as before, we saw no memory leak.
This is good news! With the same call we can now create two different states in the system. One desired, the other most definitely undesired.
Gather Metrics
So how do we gather metrics on this memory leak? Well thankfully there are a lot of tools today to do this. The Chrome web browser is one such tool that fits the job. It has the ability to inspect NodeJS processes which allows you to gather metrics, debug code, and more importantly for our needs right now, analyze memory and the garbage collector.
To do this we started by running IAP in debug mode with the command.
node –inspect server.js
This starts our NodeJS process with a debugger listening for connections. Next, we need to use Chrome to attach to this debugger.
To do this, we open chrome and enter the following in the URL bar.
chrome://inspect
We are presented with the following UI:
Under the “Remote Target” section we see our server.js process waiting to be inspected. After clicking inspect, Chrome attaches to the process and presents us with many options to analyze our process. One of the options has to do with memory. We click the memory tab and see the following options:
Here we select “Allocation instrumentation on timeline” and check the box “Record allocation stacks”. This will allow us to watch our memory allocations over time and record where they are being made. With this tool in hand, we can now run our two tests and get a snapshot of memory for each test.
1. Truncated Status API call (no network context)
Here things look pretty normal so far. There are a lot of object allocations, but no one object seems to be running away with memory. On to the second test.
2. Full status API call (with network context)
Things are looking a lot different in this second snapshot when compared to the first. We have over twice as many object allocations (36,642 vs 16,423) as in the first snapshot and the objects at the top of the list have changed. What are these new objects at the top of the pile? Clicking on one of the entries as I have in the above picture (000bb56d) you can see that it is an object being retained by the “handleMap” in the Retainers window at the bottom. The “handleMap” holds the context data when making calls over the network, which is what we do in the status call. That is most likely our culprit.
After looking at some of the history around the code that handles that map, we did indeed discover that the cleanup of unused objects in the map had been lost during a refactor. This allowed the map to grow unbounded and eventually crashing the system. We restored the logic that cleaned up unused handles, reran our tests, compared the memory snapshots, and found things to look like the image in test case 1 again. With this data, we were confident we had a solution to the leak.
The solution was tested in our User Acceptance Testing (UAT) environment and found to be effective. Memory stayed flat over a long period of testing. The solution was applied to the customer’s environment and again memory was stable. We found the cause of the leak and kicked out the unwelcomed guest.
The ability to isolate the problem, reproduce it, and gather metrics is crucial in finding memory leaks and debugging in general. Following those steps allows you to systematically track down the cause of an issue and verify whether or not your solution works.
Tags: No Tags Available