<!DOCTYPE html>

<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Hi Maxim,</p>

    <pre class="moz-quote-pre" wrap=""><font color="#1a5fb4">Nitpicking:

Added ISB as ngx_cpu_pause() for aarch64.</font></pre>

    <p></p>

    <p>Yes, we can make that change.</p>

    <pre class="moz-quote-pre" wrap=""><font color="#1a5fb4">

Could you please clarify what do you mean by "a bug"?  An empty 

ngx_cpu_pause() is certainly not a bug, it's just a lack of a more 

optimal solution for the particular architecture.

</font></pre>

    <p></p>

    <p>Agree, not a bug. I'm in a team that focuses on performance, so

      sub-optimal performance is a "bug" to us. This is not a functional

      bug. Replacing the word bug with "sub-optimal" is more

      appropriate.<br>

    </p>

    <pre class="moz-quote-pre" wrap=""><font color="#1a5fb4">When a delay like the PAUSE that is there for x86 is added,

there is a 2-5% increase in the number of requests/sec Arm CPUs

can achieve under high load.</font></pre>

    <p></p>

    <p>Yes, the test setup is very similar to what's described here

      (note, those particular instances in the blog isn't what I

      tested):</p>

    <p><a class="moz-txt-link-freetext" href="https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/nginx-performance-on-graviton-3">https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/nginx-performance-on-graviton-3</a></p>

    <p>Also, we tested on Nginx Open Source (without JWT), not

      Nginx-Plus like in the blog.<br>

    </p>

    <p><br>

      We tested for the max RPS of a 512B file that can be pulled

      through a reverse proxy. We select the number of upstreams to be

      large (8 to be exact), they are also high in core count (16+ CPU).

      The load generator node is also large (64 CPUs). This ensures the

      bottleneck is at the reverse proxy. We test small files because

      large files make the test network bounded, while smaller files

      make the test CPU bounded. </p>

    <p>I tested both ISB and YIELD (will talk about YIELD further

      below).<br>

    </p>

    <p>Results of these tests are something like this:<br>

    </p>

    <p>ISB uplift from no delay across 3 runs:</p>

    <p>- 2 CPU: 1.03 - 1.22%</p>

    <p>- 4 CPU: 2.70 - 10.75% (I'm treating the 10.75% here as an

      outlier, dropping that 10.75% gets ~5% on the high end of the

      range, hence why I'm just saying ~2-5% in change log, I don't want

      to overstate the perf improvement)<br>

    </p>

    <p>- 8 CPU: 1.1 -2.33%<br>

    </p>

    <p><br>

    </p>

    <p>YIELD uplift from no delay across 3 runs:</p>

    <p>- 2 CPU: 0 - 0.51%</p>

    <p>- 4 CPU: 0 - 1.41%</p>

    <p>- 8 CPU: 1.05 - 2.31%<br>

    </p>

    <p></p>

    <p>ISB produced the highest uplift, particularly for a 4 CPU reverse

      proxy. Hence why I submitted with ISB. Still, we see benefit with

      YIELD too.</p>

    <p>Variation comes from tearing down cloud infrastructure and

      redeploying. Results can vary depending on where you land in the

      data center. I'm intentionally leaving out exactly which HW/cloud

      I used in this data, but I can say we see similar uplift across a

      variety of Arm systems.<br>

    </p>

    <p><br>

      <font color="#1a5fb4">With respect to using YIELD and other

        projects that use alternatively use ISB:</font></p>

    <p><br>

    </p>

    <p>With respect to ISB Vs YIELD. Yes, as documented, YIELD is the

      conceptually right thing to use. However, in practice, it's a NOP

      which produces a shorter delay than ISB. Hence why ISB appears to

      work better. Also, YIELD is intended for SMT systems (uncommon on

      Arm), and hence, it's going to be a NOP for any current Arm system

      you'll find in the cloud. That said, YIELD produces uplift in RPS

      as well because even a small delay is better than no delay. I'm

      100% good with using YIELD if you want to stay true to what is

      currently documented. I was going for max perf uplift which is

      also why some other projects are also using ISB. Whether it's

      YIELD or ISB, a revisit with WFET would be in order in the more

      distant future. For today, YIELD or ISB would work better than

      nothing (as it currently is). If YIELD is more acceptable, then I

      can do YIELD.<br>

      <br>

      Projects that previously used YIELD and switched to ISB after

      noting performance improvement (I don't think these projects

      shared data anywhere, we just have to take their word):<br>

      <br>

      MongoDB:<br>

<a class="moz-txt-link-freetext" href="https://github.com/mongodb/mongo/blob/b7a92e4194cca52665e01d81dd7f9b037b59b362/src/mongo/platform/pause.h#L61">https://github.com/mongodb/mongo/blob/b7a92e4194cca52665e01d81dd7f9b037b59b362/src/mongo/platform/pause.h#L61</a><br>

      <br>

      MySQL: <br>

<a class="moz-txt-link-freetext" href="https://github.com/mysql/mysql-server/blob/87307d4ddd88405117e3f1e51323836d57ab1f57/storage/innobase/include/ut0ut.h#L108">https://github.com/mysql/mysql-server/blob/87307d4ddd88405117e3f1e51323836d57ab1f57/storage/innobase/include/ut0ut.h#L108</a><br>

      <br>

      Jemalloc:<br>

<a class="moz-txt-link-freetext" href="https://github.com/jemalloc/jemalloc/blob/e4817c8d89a2a413e835c4adeab5c5c4412f9235/configure.ac#L436">https://github.com/jemalloc/jemalloc/blob/e4817c8d89a2a413e835c4adeab5c5c4412f9235/configure.ac#L436</a><br>

    </p>

    <p><br>

    </p>

    <pre class="moz-quote-pre" wrap=""><font color="#1a5fb4">Could you please clarify reasons for the "memory" clobber here?</font></pre>

    <p></p>

    <p>Putting in the memory clobber for ISB is redundant because ISB is

      a barrier itself, but it's probably the GCC appropriate thing to

      do. I also like it as a hint for someone not familiar with ISB.

      ISB will pause the frontend (fetch-decode) to allow the CPU

      backend (execute-retire) to finish whatever operations are in

      flight. It's possible that some of those operations are writes to

      memory. Hence why we should tell the compiler "this instruction

      may update memory".</p>

    <p><br>

    </p>

    <p>__________________________________________________________________<br>

    </p>

    <pre class="moz-quote-pre" wrap="">Hello!

Thank you for the patch.

Some comments and questions below.

On Wed, Dec 06, 2023 at 10:06:57AM -0600, <a

    class="moz-txt-link-abbreviated moz-txt-link-freetext"

    href="mailto:julio.suarez@foss.arm.com">julio.suarez@foss.arm.com</a> wrote:

</pre>

    <blockquote type="cite" style="color: #007cff;">

      <pre class="moz-quote-pre" wrap=""># HG changeset patch

# User Julio Suarez <a class="moz-txt-link-rfc2396E"

      href="mailto:julio.suarez@arm.com"><julio.suarez@arm.com></a>

# Date 1701877879 21600

#      Wed Dec 06 09:51:19 2023 -0600

# Node ID 53d289b8676fc678ca90e02ece174300a3631f79

# Parent  f366007dd23a6ce8e8427c1b3042781b618a2ade

Added asm ISB as asm pause for ngx_cpu_pause() for aarch64

</pre>

    </blockquote>

    <pre class="moz-quote-pre" wrap="">Nitpicking:

Added ISB as ngx_cpu_pause() for aarch64.

</pre>

    <blockquote type="cite" style="color: #007cff;">

      <pre class="moz-quote-pre" wrap="">For aarch64 (A.K.A. Arm64), ngx_cpu_pause() evaluates to empty by the

GCC preprocessor. This results in an empty for loop in the lock

check code in ngx_rwlock.c/ngx_spinlock.c (a bug).

Thus, on Arm CPUs, there is no wait between checks of a lock.

</pre>

    </blockquote>

    <pre class="moz-quote-pre" wrap="">Could you please clarify what do you mean by "a bug"?  An empty 

ngx_cpu_pause() is certainly not a bug, it's just a lack of a more 

optimal solution for the particular architecture.

</pre>

    <blockquote type="cite" style="color: #007cff;">

      <pre class="moz-quote-pre" wrap="">When a delay like the PAUSE that is there for x86 is added,

there is a 2-5% increase in the number of requests/sec Arm CPUs

can achieve under high load.

</pre>

    </blockquote>

    <pre class="moz-quote-pre" wrap="">Could you please provide some details on how did you get these 

numbers?

</pre>

    <blockquote type="cite" style="color: #007cff;">

      <pre class="moz-quote-pre" wrap="">Currently, options for a spin lock delay element

are YIELD and ISB. YIELD is assembled into a NOP for most Arm CPUs.

YIELD is for Arm implementations with SMT.

Most Arm implementations are single core - single thread (non-SMT).

Thus, this commit uses ISB (Instruction Barrier Sync) since it

will provide a slightly longer delay than a NOP.

Other projects that implement spin locks have used ISB as

the delay element which has yielded performance gains as well.

</pre>

    </blockquote>

    <pre class="moz-quote-pre" wrap="">Looking through various open source projects, I'm seeing the 

following uses on arm64 CPUs:

FreeBSD, sys/arm64/include/cpu.h:

#define cpu_spinwait()          __asm __volatile("yield" ::: "memory")

FreeBSD, lib/libthr/arch/aarch64/include/pthread_md.h:

#define CPU_SPINWAIT

Linux, arch/arm64/include/asm/vdso/processor.h:

static inline void cpu_relax(void)

{

        asm volatile("yield" ::: "memory");

}

The only popular project I was able to find which uses ISB is 

Rust:

<a class="moz-txt-link-freetext"

href="https://github.com/rust-lang/rust/commit/c064b6560b7ce0adeb9bbf5d7dcf12b1acb0c807">https://github.com/rust-lang/rust/commit/c064b6560b7ce0adeb9bbf5d7dcf12b1acb0c807</a>

>From the commit log it looks like it mostly focuses on the delay 

introduced by the instruction, ignoring other effects. In 

particular, YIELD is expected to be more friendly for various 

emulation environments, see Linux commit here:

<a class="moz-txt-link-freetext"

href="https://github.com/torvalds/linux/commit/1baa82f48030f38d1895301f1ec93acbcb3d15db">https://github.com/torvalds/linux/commit/1baa82f48030f38d1895301f1ec93acbcb3d15db</a>

Overall, the YIELD instruction seems to be better suited and 

specifically designed for the task in question, as per the 

documentation

(<a class="moz-txt-link-freetext"

href="https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/YIELD--YIELD">https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/YIELD--YIELD</a>-):

: YIELD is a hint instruction. Software with a multithreading 

: capability can use a YIELD instruction to indicate to the PE that 

: it is performing a task, for example a spin-lock, that could be 

: swapped out to improve overall system performance. The PE can use 

: this hint to suspend and resume multiple software threads if it 

: supports the capability.

Please also note that ngx_cpu_pause() is only used on 

multiprocessor systems: the code checks if (ngx_ncpu > 1) before 

using it.

</pre>

    <blockquote type="cite" style="color: #007cff;">

      <pre class="moz-quote-pre" wrap="">Last, ISB is not an architecturally defined solution

for short spin lock delays.

A candidate for a short delay solution is the WFET

instruction (Wait For Event with Timeout).

However, there aren't any Arm implementations in the market that

support this instruction yet. When that occurs,

WFET can be tested as a replacement for ISB. Until then,

ISB will do.

diff -r f366007dd23a -r 53d289b8676f src/os/unix/ngx_atomic.h

--- a/src/os/unix/ngx_atomic.h  Tue Nov 14 15:26:02 2023 +0400

+++ b/src/os/unix/ngx_atomic.h  Wed Dec 06 09:51:19 2023 -0600

@@ -66,6 +66,8 @@

 #if ( __i386__ || __i386 || __amd64__ || __amd64 )

 #define ngx_cpu_pause()             __asm__ ("pause")

+#elif ( __aarch64__ )

+#define ngx_cpu_pause()             __asm__ ("isb" ::: "memory")

</pre>

    </blockquote>

    <pre class="moz-quote-pre" wrap="">Could you please clarify reasons for the "memory" clobber here?

</pre>

    <blockquote type="cite" style="color: #007cff;">

      <pre class="moz-quote-pre" wrap=""> #else

 #define ngx_cpu_pause()

 #endif

</pre>

    </blockquote>

    <pre class="moz-quote-pre" wrap=""><div class="moz-txt-sig">-- 

Maxim Dounin

<a class="moz-txt-link-freetext" href="http://mdounin.ru/">http://mdounin.ru/</a></div></pre>

    <p></p>

  </body>

</html>