<!DOCTYPE html>

<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Hi,</p>

    <p><br>

    </p>

    <p>This is very discussion helpful.</p>

    <p><br>

    </p>

    <p>1.<br>

    </p>

    <p>Yes, double checked configuration (what I'm running isn't exactly

      what's in that link). No shared memory zones or thread pools

      enabled. Sounds like a change in configuration is needed to test

      this.</p>

    <p>Would enabling proxy_cache_path be sufficient for this, or should

      this be done another way?</p>

    <p><br>

    </p>

    <p>When proxy_cache_path is enabled, I see calls to ngx_shmtx_lock

      & ngx_shmtx_unlock in the profile. The assembly annotations

      are also showing isb being executed (when I put in the ISB). I

      could try testing like this with both ISB & YIELD. Looking for

      guidance if you think it's worth a try. Overall, I'd like to sort

      out if the fact that there is no ngx_cpu_pause on aarch64 is sub

      optimal. The missing ngx_cpu_pause means there is no wait and

      subsequently, there is also no back off mechanism because the

      empty for loop is optimized away.<br>

    </p>

    <p><br>

    </p>

    <p>2.<br>

    </p>

    <p>For code alignment question, I tried <span

      style="white-space: pre-wrap">-falign-{functions,jumps}=64. ministat say's no diff.</span></p>

    <pre><span style="white-space: pre-wrap">x Baseline

+ BaselinewAlign

+----------------------------------------------------------------------+

|                           xx*                                        |

|+             x   + + x+   *x*   ++ x+   ++*+   x  x        +   x    x|

|                     |_______M______A_______________|                 |

|                  |_____________AM____________|                       |

+----------------------------------------------------------------------+

    N           Min           Max        Median           Avg        Stddev

x  15        129548        131751        130154        130442     622.46629

+  15        129000        131376        130306        130273     551.93064

No difference proven at 95.0% confidence

</span></pre>

    <p><span style="white-space: pre-wrap">

</span></p>

    <p><span style="white-space: pre-wrap">3.</span></p>

    <p><span style="white-space: pre-wrap">ministat for comparing blank ngx_cpu_pause() to ISB & YIELD (no memory clobber). </span></p>

    <p><span style="white-space: pre-wrap">Ministat say's significant difference. I have see it where ISB returns like ~10% +/- ~2%, however, I'm going to discount that as cloud variation/noise. A "lucky run".</span></p>

    <p><span style="white-space: pre-wrap">That said, it sounds like this is some kind of side effect of adding this into the binary as you mentioned previously. This diff oddly consistent though, or at least oddly consistent dumb luck.

</span></p>

    <p><span style="white-space: pre-wrap">

</span></p>

    <pre><span style="white-space: pre-wrap">x Baseline

+ ISB

* YIELD

+--------------------------------------------------------------------------------+

|          xxx                           * +    +           +                    |

|x   +  x  xxx    x    **  *xx ***  *  x ****  *+ + *  +    *        +          +|

|     |______M____A___________|                                                  |

|                                  |______________MA_______________|             |

|                           |_________A__M_______|                               |

+--------------------------------------------------------------------------------+

    N           Min           Max        Median           Avg        Stddev

x  15        129548        131751        130154        130442     622.46629

+  15     129778.64     133639.52      132108.5     132135.41     844.66065

Difference at 95.0% confidence

        1693.41 +/- 554.832

        1.29821% +/- 0.425348%

        (Student's t, pooled s = 741.929)

*  15        130679        132621        131596     131486.47     540.21198

Difference at 95.0% confidence

        1044.47 +/- 435.826

        0.800713% +/- 0.334115%

        (Student's t, pooled s = 582.792)

</span></pre>

    <p><span style="white-space: pre-wrap">

</span></p>

    <p><span style="white-space: pre-wrap">4.</span></p>

    <p><span style="white-space: pre-wrap">ACK on ISB memory clobber points. Makes sense.

</span></p>

    <p><span style="white-space: pre-wrap">

</span></p>

    <p></p>

    <div class="moz-cite-prefix">On 12/12/23 18:16, Maxim Dounin wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:ZXj37gmV79Ygfo50@mdounin.ru">

      <pre class="moz-quote-pre" wrap="">Hello!

On Mon, Dec 11, 2023 at 05:09:17PM -0600, Julio Suarez wrote:

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">Hi Maxim,

</pre>

        <blockquote type="cite">

          <pre class="moz-quote-pre" wrap="">Nitpicking: Added ISB as ngx_cpu_pause() for aarch64.

</pre>

        </blockquote>

        <pre class="moz-quote-pre" wrap="">

Yes, we can make that change.

</pre>

        <blockquote type="cite">

          <pre class="moz-quote-pre" wrap="">Could you please clarify what do you mean by "a bug"? An empty 

ngx_cpu_pause() is certainly not a bug, it's just a lack of a more 

optimal solution for the particular architecture.

</pre>

        </blockquote>

        <pre class="moz-quote-pre" wrap="">

Agree, not a bug. I'm in a team that focuses on performance, so 

sub-optimal performance is a "bug" to us. This is not a functional bug. 

Replacing the word bug with "sub-optimal" is more appropriate.

When a delay like the PAUSE that is there for x86 is added, there is a 

2-5% increase in the number of requests/sec Arm CPUs can achieve under 

high load.

Yes, the test setup is very similar to what's described here (note, 

those particular instances in the blog isn't what I tested):

<a class="moz-txt-link-freetext" href="https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/nginx-performance-on-graviton-3">https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/nginx-performance-on-graviton-3</a>

Also, we tested on Nginx Open Source (without JWT), not Nginx-Plus like 

in the blog.

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

The only nginx configs I see there (or, rather, in linked 

articles) do not use neither shared memory zones nor thread pools.  

That is, ngx_cpu_pause() is not used in these configurations at 

all.

If you think it was actually used in your tests, could you please 

provide nginx configuration you've used for testing?

If nginx configs you've used do not actually contain shared memory 

zones or thread pools, the performance changes you are seeing, 

even if these are statistically significant (see below about 

ministat(1)), might be the result of code alignment changes.  

MySQL bug mentioned below uses -falign-{functions,jumps}=64

to minimize effects of the code alignment changes 

(<a class="moz-txt-link-freetext" href="https://bugs.mysql.com/bug.php?id=100664">https://bugs.mysql.com/bug.php?id=100664</a>), it might worth to do 

the same.

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">We tested for the max RPS of a 512B file that can be pulled through a 

reverse proxy. We select the number of upstreams to be large (8 to be 

exact), they are also high in core count (16+ CPU). The load generator 

node is also large (64 CPUs). This ensures the bottleneck is at the 

reverse proxy. We test small files because large files make the test 

network bounded, while smaller files make the test CPU bounded.

I tested both ISB and YIELD (will talk about YIELD further below).

Results of these tests are something like this:

ISB uplift from no delay across 3 runs:

- 2 CPU: 1.03 - 1.22%

- 4 CPU: 2.70 - 10.75% (I'm treating the 10.75% here as an outlier, 

dropping that 10.75% gets ~5% on the high end of the range, hence why 

I'm just saying ~2-5% in change log, I don't want to overstate the perf 

improvement)

- 8 CPU: 1.1 -2.33%

YIELD uplift from no delay across 3 runs:

- 2 CPU: 0 - 0.51%

- 4 CPU: 0 - 1.41%

- 8 CPU: 1.05 - 2.31%

ISB produced the highest uplift, particularly for a 4 CPU reverse proxy. 

Hence why I submitted with ISB. Still, we see benefit with YIELD too.

Variation comes from tearing down cloud infrastructure and redeploying. 

Results can vary depending on where you land in the data center. I'm 

intentionally leaving out exactly which HW/cloud I used in this data, 

but I can say we see similar uplift across a variety of Arm systems.

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

Could you please share some raw numbers from different tests?  It 

would be interesting to see ministat(1) results.

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">With respect to using YIELD and other projects that use alternatively 

use ISB:

With respect to ISB Vs YIELD. Yes, as documented, YIELD is the 

conceptually right thing to use. However, in practice, it's a NOP which 

produces a shorter delay than ISB. Hence why ISB appears to work better. 

Also, YIELD is intended for SMT systems (uncommon on Arm), and hence, 

it's going to be a NOP for any current Arm system you'll find in the 

cloud. That said, YIELD produces uplift in RPS as well because even a 

small delay is better than no delay. I'm 100% good with using YIELD if 

you want to stay true to what is currently documented. I was going for 

max perf uplift which is also why some other projects are also using 

ISB. Whether it's YIELD or ISB, a revisit with WFET would be in order in 

the more distant future. For today, YIELD or ISB would work better than 

nothing (as it currently is). If YIELD is more acceptable, then I can do 

YIELD.

Projects that previously used YIELD and switched to ISB after noting 

performance improvement (I don't think these projects shared data 

anywhere, we just have to take their word):

MongoDB:

<a class="moz-txt-link-freetext" href="https://github.com/mongodb/mongo/blob/b7a92e4194cca52665e01d81dd7f9b037b59b362/src/mongo/platform/pause.h#L61">https://github.com/mongodb/mongo/blob/b7a92e4194cca52665e01d81dd7f9b037b59b362/src/mongo/platform/pause.h#L61</a>

MySQL:

<a class="moz-txt-link-freetext" href="https://github.com/mysql/mysql-server/blob/87307d4ddd88405117e3f1e51323836d57ab1f57/storage/innobase/include/ut0ut.h#L108">https://github.com/mysql/mysql-server/blob/87307d4ddd88405117e3f1e51323836d57ab1f57/storage/innobase/include/ut0ut.h#L108</a>

Jemalloc:

<a class="moz-txt-link-freetext" href="https://github.com/jemalloc/jemalloc/blob/e4817c8d89a2a413e835c4adeab5c5c4412f9235/configure.ac#L436">https://github.com/jemalloc/jemalloc/blob/e4817c8d89a2a413e835c4adeab5c5c4412f9235/configure.ac#L436</a>

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

Thanks for the links.

For the record, here are relevant commits / pull requests:

<a class="moz-txt-link-freetext" href="https://github.com/wiredtiger/wiredtiger/pull/6080">https://github.com/wiredtiger/wiredtiger/pull/6080</a>

<a class="moz-txt-link-freetext" href="https://github.com/mongodb/mongo/commit/6979525674af67405984c58585766dd4d0c3f2a8">https://github.com/mongodb/mongo/commit/6979525674af67405984c58585766dd4d0c3f2a8</a>

<a class="moz-txt-link-freetext" href="https://bugs.mysql.com/bug.php?id=100664">https://bugs.mysql.com/bug.php?id=100664</a>

<a class="moz-txt-link-freetext" href="https://github.com/mysql/mysql-server/commit/f2a4ed5b65a6c03ee1bea60b8c3bb4db64dbed10">https://github.com/mysql/mysql-server/commit/f2a4ed5b65a6c03ee1bea60b8c3bb4db64dbed10</a>

<a class="moz-txt-link-freetext" href="https://github.com/jemalloc/jemalloc/pull/2205">https://github.com/jemalloc/jemalloc/pull/2205</a>

<a class="moz-txt-link-freetext" href="https://github.com/jemalloc/jemalloc/commit/89fe8ee6bf7a23556350d883a310c0224a171879">https://github.com/jemalloc/jemalloc/commit/89fe8ee6bf7a23556350d883a310c0224a171879</a>

At least MySQL bug seems to have some interesting details.

</pre>

      <blockquote type="cite">

        <blockquote type="cite">

          <pre class="moz-quote-pre" wrap="">Could you please clarify reasons for the "memory" clobber here?

</pre>

        </blockquote>

        <pre class="moz-quote-pre" wrap="">

Putting in the memory clobber for ISB is redundant because ISB is a 

barrier itself, but it's probably the GCC appropriate thing to do. I 

also like it as a hint for someone not familiar with ISB. ISB will pause 

the frontend (fetch-decode) to allow the CPU backend (execute-retire) to 

finish whatever operations are in flight. It's possible that some of 

those operations are writes to memory. Hence why we should tell the 

compiler "this instruction may update memory".

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

I don't think this interpretation is correct - memory is updated 

by other instructions, not ISB.  And it's the compiler who emitted 

these instructions, so it perfectly knows that these will 

eventually update memory.

Note that the ngx_cpu_pause() call does not need to be a barrier, 

neither a hardware barrier nor a compiler barrier.  Instead, nginx 

relies on using volatile variables for locks, so these are always 

loaded from memory by the compiler on pre-lock checks (and will 

test the updated value if it's changed by other CPUs), and proper 

barrier semantics of ngx_atomic_cmp_set().

The "memory" clobber essentially tells compiler that it cannot 

optimize stores and loads across the call, and must reload 

anything from memory after the call.  While this might be expected 

in other uses (for example, cpu_relax() in Linux is expected to be 

a compiler barrier), this is not something needed in 

ngx_cpu_pause().

[...]

</pre>

    </blockquote>

  </body>

</html>