High CPU usage with RHEL9 and nginx at 20Gbps

Mon Nov 21 15:07:13 UTC 2022

Hello!

On Mon, Nov 21, 2022 at 01:03:25PM +0100, Marcin Wanat wrote:

> Hi,
> 
> i have a problem with worker high CPU usage on recent Rocky Linux 9 with
> nginx setup.
> 
> I have 10 identical servers, each with Mellanox ConnectX-5 dual 25Gbps
> NIC, 6x Samsung PM983 NVMe drives, 512GB RAM and Epyc 7402 CPU.
> 
> 9 of them are using RockyLinux 8.7 (some with stock, some with
> 5.18.10-1.el8.elrepo.x86_64 kernel) with nginx and passing ~20Gbps of
> SSL/http2 traffic using 24workers and reuseport config option. Each
> worker have about 10-12% CPU usage and nginx worker strace looks like
> this (normal):
> 
> strace: Process 73930 detached
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  72.59    0.018036           1      9329       195 write
>  15.61    0.003879           9       400           io_submit
>   5.70    0.001416           2       681           epoll_wait
>   2.93    0.000729           1       569        88 read
>   1.81    0.000449           1       394           io_getevents
>   0.81    0.000202           3        51           sendto
>   0.52    0.000129           1       106           fcntl
>   0.02    0.000005           5         1           shutdown
>   0.01    0.000002           1         2           epoll_ctl
>   0.00    0.000000           0         5           getpid
>   0.00    0.000000           0         1           recvfrom
>   0.00    0.000000           0         1           setsockopt
>   0.00    0.000000           0         1           accept4
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.024847           2     11541       283 total
> 
> 
> On 10th server we are now testing Rocky Linux 9 configuration and each
> worker is using about 60-80% of CPU with following strace from nginx
> worker:
> 
> strace: Process 1966 detached
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  88.44    0.461237        1130       408           io_submit
>  10.68    0.055674           2     18658       147 write
>   0.25    0.001283           1       655       191 read
>   0.20    0.001033           2       435           epoll_wait
>   0.18    0.000914           5       156           fcntl
>   0.08    0.000420           1       306           futex
>   0.07    0.000378           5        65           sendto
>   0.07    0.000348           1       271           io_getevents
>   0.02    0.000085           9         9           openat
>   0.01    0.000037          12         3           accept4
>   0.01    0.000034           8         4           close
>   0.00    0.000023           2         9           newfstatat
>   0.00    0.000021           0        23           getpid
>   0.00    0.000015           2         6           setsockopt
>   0.00    0.000008           1         6           epoll_ctl
>   0.00    0.000006           2         3           recvfrom
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.521516          24     21017       338 total
> 
> 
> As you can see, it is spending a lot of time in io_submit. It has 1130
> usec/call while RL8 servers have only 9 usec/call in io_submit.
> 
> Only thing that has changed is RL8->RL9. All servers have identical
> hardware. All servers are serving almost identical amount of traffic
> ~19-20Gbps from identical files with identical number of connections.
> Iostat on all servers shows about 50% util on each drive, similar
> rKB/s, r/s and rareq-sz so disk IO should not be a problem.
> 
> nginx-1.22.1 compiled with:
> ./configure --prefix=/usr/local/nginx --with-http_mp4_module
> --with-http_secure_link_
> module  --with-http_stub_status_module
> --with-http_ssl_module --with-http_v2_module --with-pcre
> --with-file-aio --with-threads --with-cc-opt=' -DTCP_FASTOPEN=23'
> --with-http_sub_module
> Important parts of nginx.conf:
>     worker_processes  24;
>     worker_cpu_affinity auto;
>     worker_rlimit_nofile 81920;
>     events {
>         worker_connections   2000;
>         use epoll;
>     }
> 
>     sendfile off;
>     aio on;
>     directio 4096;
>     directio_alignment 4k;
>     tcp_nodelay      on;
> 
> Any ideas what could be causing the problem?

A lot of time in io_submit() might indicate that AIO read requests 
are blocking for some reason.

Number of the io_getevents() calls also seems to indirectly 
confirm this assumption: note that there were 394 io_getevents() 
calls for 400 io_submit() calls, and on RL9 there are only 271 
io_getevents() calls for 408 io_submit() calls, so most of the 
io_getevents() calls returned more than one event.

This used to be the case when directio is not used, though in your 
configuration directio seems to be correctly enabled.  Some 
changes in the kernel and/or OS configuration might be responsible 
for similar results though.  It might be non-trivial to find out 
what exactly caused this though.

First of all, I would recommend to check if file systems on 
different servers are identical.

Also it might make sense to play with output_buffers to make sure 
it's not the too small buffer size which degrades things, with 
directio_alignment to see if there are any potential alignment 
changes, and to play with worker_aio_requests to see if there is 
enough room for outstanding AIO requests.

-- 
Maxim Dounin
http://mdounin.ru/