[PATCH] Added asm ISB as asm pause for ngx_cpu_pause() for aarch64

Thu Dec 14 04:53:49 UTC 2023

Hello!

On Wed, Dec 13, 2023 at 04:16:15PM -0600, Julio Suarez wrote:

> 1.
> 
> Yes, double checked configuration (what I'm running isn't exactly what's 
> in that link). No shared memory zones or thread pools enabled. Sounds 
> like a change in configuration is needed to test this.
> 
> Would enabling proxy_cache_path be sufficient for this, or should this 
> be done another way?
> 
> When proxy_cache_path is enabled, I see calls to ngx_shmtx_lock & 
> ngx_shmtx_unlock in the profile. The assembly annotations are also 
> showing isb being executed (when I put in the ISB). I could try testing 
> like this with both ISB & YIELD. Looking for guidance if you think it's 
> worth a try. Overall, I'd like to sort out if the fact that there is no 
> ngx_cpu_pause on aarch64 is sub optimal. The missing ngx_cpu_pause means 
> there is no wait and subsequently, there is also no back off mechanism 
> because the empty for loop is optimized away.

In general I think it would be non-trivial to construct a workload 
which will be able to demonstrate a difference, if at all, 
especially on platforms with posix semaphores available.  And 
that's the reason why of my initial question on how did you get 
the numbers.

The proxy_cache_path alone is certainly not enough.  At least you 
have to actually enable caching with the proxy_cache directive.  
And most likely you'll have to play with the number of nginx 
worker processes and the workload to achieve at least some level 
of lock contention.

Further, some effects are simply cannot be seen from just 
performance tests.  For example, consider two different 
instructions which introduce exactly the same delay, but one of 
them due to explicitly requested processor pause, and another one 
due to a calculation which requires the same time.  There will be 
no performance difference between the two - still, there will be a 
difference in power consumed by the CPU.

> 2.
> 
> For code alignment question, I tried -falign-{functions,jumps}=64. 
> ministat say's no diff.
> 
> x Baseline
> + BaselinewAlign
> +----------------------------------------------------------------------+
> |                           xx*                                        |
> |+             x   + + x+   *x*   ++ x+   ++*+   x  x        +   x    x|
> |                     |_______M______A_______________|                 |
> |                  |_____________AM____________|                       |
> +----------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x  15        129548        131751        130154        130442     622.46629
> +  15        129000        131376        130306        130273     551.93064
> No difference proven at 95.0% confidence

This might indicate you've measured some other effect, and not the 
alignment.  Also, it might worth checking in the compiled result 
that the alignment is actually applied.

(Note that text/plain part of your message contains garbled text, 
I've restored the above quote manually from the text/html part.  
It might worth switching to plain text in your mail client for 
further messages here.)

> 3.
> 
> ministat for comparing blank ngx_cpu_pause() to ISB & YIELD (no memory 
> clobber).
> 
> Ministat say's significant difference. I have see it where ISB returns 
> like ~10% +/- ~2%, however, I'm going to discount that as cloud 
> variation/noise. A "lucky run".
> 
> That said, it sounds like this is some kind of side effect of adding 
> this into the binary as you mentioned previously. This diff oddly 
> consistent though, or at least oddly consistent dumb luck.
> 
> x Baseline
> + ISB
> * YIELD
> +--------------------------------------------------------------------------------+
> |          xxx                           * +    +           +                    |
> |x   +  x  xxx    x    **  *xx ***  *  x ****  *+ + *  +    *        +          +|
> |     |______M____A___________|                                                  |
> |                                  |______________MA_______________|             |
> |                           |_________A__M_______|                               |
> +--------------------------------------------------------------------------------+
>     N           Min           Max        Median           Avg        Stddev
> x  15        129548        131751        130154        130442     622.46629
> +  15     129778.64     133639.52      132108.5     132135.41     844.66065
> Difference at 95.0% confidence
>         1693.41 +/- 554.832
>         1.29821% +/- 0.425348%
>         (Student's t, pooled s = 741.929)
> *  15        130679        132621        131596     131486.47     540.21198
> Difference at 95.0% confidence
>         1044.47 +/- 435.826
>         0.800713% +/- 0.334115%
>         (Student's t, pooled s = 582.792)

That's without any caching being used, that is, basically just a 
result of slightly different compilation, correct?

This might be seen as a reference point of how slightly different 
compilation can affect performance.  We've previously seen 
cases of 2-3% performance improvement observed as a result of a 
nop change, and these results seem to be in line.

Tuning compilation to ensure there is no difference here might be 
the way to go.

-- 
Maxim Dounin
http://mdounin.ru/