[PATCH] Added asm ISB as asm pause for ngx_cpu_pause() for aarch64

Wed Dec 13 00:16:46 UTC 2023

Hello!

On Mon, Dec 11, 2023 at 05:09:17PM -0600, Julio Suarez wrote:

> Hi Maxim,
> 
> > Nitpicking: Added ISB as ngx_cpu_pause() for aarch64.
> 
> Yes, we can make that change.
> 
> > Could you please clarify what do you mean by "a bug"? An empty 
> > ngx_cpu_pause() is certainly not a bug, it's just a lack of a more 
> > optimal solution for the particular architecture.
> 
> Agree, not a bug. I'm in a team that focuses on performance, so 
> sub-optimal performance is a "bug" to us. This is not a functional bug. 
> Replacing the word bug with "sub-optimal" is more appropriate.
> 
> When a delay like the PAUSE that is there for x86 is added, there is a 
> 2-5% increase in the number of requests/sec Arm CPUs can achieve under 
> high load.
> 
> Yes, the test setup is very similar to what's described here (note, 
> those particular instances in the blog isn't what I tested):
> 
> https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/nginx-performance-on-graviton-3
> 
> Also, we tested on Nginx Open Source (without JWT), not Nginx-Plus like 
> in the blog.

The only nginx configs I see there (or, rather, in linked 
articles) do not use neither shared memory zones nor thread pools.  
That is, ngx_cpu_pause() is not used in these configurations at 
all.

If you think it was actually used in your tests, could you please 
provide nginx configuration you've used for testing?

If nginx configs you've used do not actually contain shared memory 
zones or thread pools, the performance changes you are seeing, 
even if these are statistically significant (see below about 
ministat(1)), might be the result of code alignment changes.  
MySQL bug mentioned below uses -falign-{functions,jumps}=64
to minimize effects of the code alignment changes 
(https://bugs.mysql.com/bug.php?id=100664), it might worth to do 
the same.

> We tested for the max RPS of a 512B file that can be pulled through a 
> reverse proxy. We select the number of upstreams to be large (8 to be 
> exact), they are also high in core count (16+ CPU). The load generator 
> node is also large (64 CPUs). This ensures the bottleneck is at the 
> reverse proxy. We test small files because large files make the test 
> network bounded, while smaller files make the test CPU bounded.
> 
> I tested both ISB and YIELD (will talk about YIELD further below).
> 
> Results of these tests are something like this:
> 
> ISB uplift from no delay across 3 runs:
> 
> - 2 CPU: 1.03 - 1.22%
> 
> - 4 CPU: 2.70 - 10.75% (I'm treating the 10.75% here as an outlier, 
> dropping that 10.75% gets ~5% on the high end of the range, hence why 
> I'm just saying ~2-5% in change log, I don't want to overstate the perf 
> improvement)
> 
> - 8 CPU: 1.1 -2.33%
> 
> 
> YIELD uplift from no delay across 3 runs:
> 
> - 2 CPU: 0 - 0.51%
> 
> - 4 CPU: 0 - 1.41%
> 
> - 8 CPU: 1.05 - 2.31%
> 
> ISB produced the highest uplift, particularly for a 4 CPU reverse proxy. 
> Hence why I submitted with ISB. Still, we see benefit with YIELD too.
> 
> Variation comes from tearing down cloud infrastructure and redeploying. 
> Results can vary depending on where you land in the data center. I'm 
> intentionally leaving out exactly which HW/cloud I used in this data, 
> but I can say we see similar uplift across a variety of Arm systems.

Could you please share some raw numbers from different tests?  It 
would be interesting to see ministat(1) results.

> With respect to using YIELD and other projects that use alternatively 
> use ISB:
> 
> 
> With respect to ISB Vs YIELD. Yes, as documented, YIELD is the 
> conceptually right thing to use. However, in practice, it's a NOP which 
> produces a shorter delay than ISB. Hence why ISB appears to work better. 
> Also, YIELD is intended for SMT systems (uncommon on Arm), and hence, 
> it's going to be a NOP for any current Arm system you'll find in the 
> cloud. That said, YIELD produces uplift in RPS as well because even a 
> small delay is better than no delay. I'm 100% good with using YIELD if 
> you want to stay true to what is currently documented. I was going for 
> max perf uplift which is also why some other projects are also using 
> ISB. Whether it's YIELD or ISB, a revisit with WFET would be in order in 
> the more distant future. For today, YIELD or ISB would work better than 
> nothing (as it currently is). If YIELD is more acceptable, then I can do 
> YIELD.
> 
> Projects that previously used YIELD and switched to ISB after noting 
> performance improvement (I don't think these projects shared data 
> anywhere, we just have to take their word):
> 
> MongoDB:
> https://github.com/mongodb/mongo/blob/b7a92e4194cca52665e01d81dd7f9b037b59b362/src/mongo/platform/pause.h#L61
> 
> MySQL:
> https://github.com/mysql/mysql-server/blob/87307d4ddd88405117e3f1e51323836d57ab1f57/storage/innobase/include/ut0ut.h#L108
> 
> Jemalloc:
> https://github.com/jemalloc/jemalloc/blob/e4817c8d89a2a413e835c4adeab5c5c4412f9235/configure.ac#L436

Thanks for the links.
For the record, here are relevant commits / pull requests:

https://github.com/wiredtiger/wiredtiger/pull/6080
https://github.com/mongodb/mongo/commit/6979525674af67405984c58585766dd4d0c3f2a8

https://bugs.mysql.com/bug.php?id=100664
https://github.com/mysql/mysql-server/commit/f2a4ed5b65a6c03ee1bea60b8c3bb4db64dbed10

https://github.com/jemalloc/jemalloc/pull/2205
https://github.com/jemalloc/jemalloc/commit/89fe8ee6bf7a23556350d883a310c0224a171879

At least MySQL bug seems to have some interesting details.

> > Could you please clarify reasons for the "memory" clobber here?
> 
> Putting in the memory clobber for ISB is redundant because ISB is a 
> barrier itself, but it's probably the GCC appropriate thing to do. I 
> also like it as a hint for someone not familiar with ISB. ISB will pause 
> the frontend (fetch-decode) to allow the CPU backend (execute-retire) to 
> finish whatever operations are in flight. It's possible that some of 
> those operations are writes to memory. Hence why we should tell the 
> compiler "this instruction may update memory".

I don't think this interpretation is correct - memory is updated 
by other instructions, not ISB.  And it's the compiler who emitted 
these instructions, so it perfectly knows that these will 
eventually update memory.

Note that the ngx_cpu_pause() call does not need to be a barrier, 
neither a hardware barrier nor a compiler barrier.  Instead, nginx 
relies on using volatile variables for locks, so these are always 
loaded from memory by the compiler on pre-lock checks (and will 
test the updated value if it's changed by other CPUs), and proper 
barrier semantics of ngx_atomic_cmp_set().

The "memory" clobber essentially tells compiler that it cannot 
optimize stores and loads across the call, and must reload 
anything from memory after the call.  While this might be expected 
in other uses (for example, cpu_relax() in Linux is expected to be 
a compiler barrier), this is not something needed in 
ngx_cpu_pause().

[...]

-- 
Maxim Dounin
http://mdounin.ru/