Debugging Nginx Memory Spikes on Production Servers

Thu Sep 21 03:37:10 UTC 2023

Thank you, Maxim.

I’ve been doing some testing since I reached out earlier and I’m not sure whether I’m looking at a memory leak in Nginx/NJS or whether I’m looking at some sort of quirk in how memory stats are being reported by Nginx. All that I know is that my testing looks like a memory leak and under the right conditions, I've seen what appears to be a single Nginx worker thread run away with its memory use until my OOM monitor terminates the thread (which also seems to have some connection with memory use and file I/O). While trying to use some buffers for large file reads in NJS, I started noticing strange memory behavior in basic file operations.

To keep a long story short, I use NJS to control some elements of Nginx and it seems like any form of file I/O in NJS is causing NJS to leak memory. As it stands, I'm not really using many Nginx modules to begin with but to reduce the potential for 3rd party module problems, I recompiled Nginx with nothing but Nginx and NJS. I’m using Nginx 1.23.4 and NJS 0.8.1 but I’ve seen the same behavior with earlier versions of Nginx and NJS.

I’ve tried this with several different tests and I see the same thing with all variations. Any form of repeat file I/O “seems” like it is leaking memory. Here is some sample code that I used in a test.

In the http block, I’ve imported a test.js script that I then use to set a variable with js_set
js_set $test test.test;

At the top of the server block after the minimum set of needed server definitions (server_name, etc)
if ($test = 1) { return 200; }

In the test.js file:
function test(r){
let i = 0;
while(i < 500){
i++;
r.log(njs.memoryStats.size);
}
return 1;
}

export default {test}

Checking the memory use in the info logs after this shows this.

Start of loop:
2023/09/20 21:42:15 [info] 1394272#1394272: *113 js: 32120
2023/09/20 21:42:15 [info] 1394272#1394272: *113 js: 40312

End of loop:
2023/09/20 21:42:15 [info] 1394272#1394272: *113 js: 499064
2023/09/20 21:42:15 [info] 1394272#1394272: *113 js: 499064

If you increase the loop to higher #’s of loops, it just keeps going. Here’s the end of the loop on 10000 runs:
2023/09/20 21:57:04 [info] 1404965#1404965: *4 js: 4676984
2023/09/20 21:57:04 [info] 1404965#1404965: *4 js: 4676984

The moment that I move the r.log statements out of the loop, the start/end memory use appears to be about the same as the start of the loop memory above. So this seems to have some sort of correlation with the amount of data being written to the file. Given that Nginx log writes are supposed to be using buffered writes according to the Nginx docs, I would expect the max memory used during log writes to cap out at some much lower value. We’re not specifying a buffer size so the default of 64k should apply here but by the end of the test loop above, we’re sitting at either 0.5mb or 4.6mb depending on which of the loop sizes (1000 vs 10000) we’re looking at.

The problem is that I am actually trying to sort out a memory issue that I think has to do with large file reads rather than writes and since I’m getting this sort of high memory use data when just writing to log files to test things out, it makes it appear as if the problem is both for file reads and file writes so I have no idea whether buffered file reads are using less memory than reading the entire file into memory or not. A buffered read “should” use less total memory. But since the end memory stats in any testing that I do look the same either way, I can’t tell.

I’ve seen the exact same memory behavior with fs.appendFileSync. So regardless of whether I use r.log, r.error, or fs.appendFileSync to write to some file that isn’t a default Nginx log file, I’m getting this output that suggests a memory leak. So it’s not specific to log file writes.

I realize that these test cases aren’t necessarily realistic as large batches of file writes (or just large file writes) from NJS are likely going to be far less common than large file reads. But either way, whether it’s a large file read that isn’t constricting its memory footprint to the buffer that it’s assigned or whether it’s file writes doing the same, it seems like a problem.

So I guess my question at the moment is whether endless memory use growth being reported by njs.memoryStats.size after file writes is some sort of false positive tied to quirks in how memory use is being reported or whether this is indicative of a memory leak? Any insight would be appreicated.

—
Lance Dockins

> On Wednesday, Sep 20, 2023 at 2:07 PM, Maxim Dounin <mdounin at mdounin.ru (mailto:mdounin at mdounin.ru)> wrote:
> Hello!
>
> On Wed, Sep 20, 2023 at 11:55:39AM -0500, Lance Dockins wrote:
>
> > Are there any best practices or processes for debugging sudden memory
> > spikes in Nginx on production servers? We have a few very high-traffic
> > servers that are encountering events where the Nginx process memory
> > suddenly spikes from around 300mb to 12gb of memory before being shut down
> > by an out-of-memory termination script. We don't have Nginx compiled with
> > debug mode and even if we did, I'm not sure that we could enable that
> > without overly taxing the server due to the constant high traffic load that
> > the server is under. Since it's a server with public websites on it, I
> > don't know that we could filter the debug log to a single IP either.
> >
> > Access, error, and info logs all seem to be pretty normal. Internal
> > monitoring of the Nginx process doesn't suggest that there are major
> > connection spikes either. Theoretically, it is possible that there is just
> > a very large sudden burst of traffic coming in that is hitting our rate
> > limits very hard and bumping the memory that Nginx is using until the OOM
> > termination process closes Nginx (which would prevent Nginx from logging
> > the traffic). We just don't have a good way to see where the memory in
> > Nginx is being allocated when these sorts of spikes occur and are looking
> > for any good insight into how to go about debugging that sort of thing on a
> > production server.
> >
> > Any insights into how to go about troubleshooting it?
>
> In no particular order:
>
> - Make sure you are monitoring connection and request numbers as
> reported by the stub_status module as well as memory usage.
>
> - Check 3rd party modules you are using, if there are any - try
> disabling them.
>
> - If you are using subrequests, such as with SSI, make sure these
> won't generate enormous number of subrequests.
>
> - Check your configuration for buffer sizes and connection limits,
> and make sure that your server can handle maximum memory
> allocation without invoking the OOM Killer, that is:
> worker_processes * worker_connections * (total amount of various
> buffers as allocated per connection). If not, consider reducing
> various parts of the equation.
>
> Hope this helps.
>
> --
> Maxim Dounin
> http://mdounin.ru/
> _______________________________________________
> nginx mailing list
> nginx at nginx.org
> https://mailman.nginx.org/mailman/listinfo/nginx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20230920/c5a0879a/attachment-0001.htm>