[PATCH] Added asm ISB as asm pause for ngx_cpu_pause() for aarch64
Maxim Dounin
mdounin at mdounin.ru
Thu Dec 14 04:53:49 UTC 2023
Hello!
On Wed, Dec 13, 2023 at 04:16:15PM -0600, Julio Suarez wrote:
> 1.
>
> Yes, double checked configuration (what I'm running isn't exactly what's
> in that link). No shared memory zones or thread pools enabled. Sounds
> like a change in configuration is needed to test this.
>
> Would enabling proxy_cache_path be sufficient for this, or should this
> be done another way?
>
> When proxy_cache_path is enabled, I see calls to ngx_shmtx_lock &
> ngx_shmtx_unlock in the profile. The assembly annotations are also
> showing isb being executed (when I put in the ISB). I could try testing
> like this with both ISB & YIELD. Looking for guidance if you think it's
> worth a try. Overall, I'd like to sort out if the fact that there is no
> ngx_cpu_pause on aarch64 is sub optimal. The missing ngx_cpu_pause means
> there is no wait and subsequently, there is also no back off mechanism
> because the empty for loop is optimized away.
In general I think it would be non-trivial to construct a workload
which will be able to demonstrate a difference, if at all,
especially on platforms with posix semaphores available. And
that's the reason why of my initial question on how did you get
the numbers.
The proxy_cache_path alone is certainly not enough. At least you
have to actually enable caching with the proxy_cache directive.
And most likely you'll have to play with the number of nginx
worker processes and the workload to achieve at least some level
of lock contention.
Further, some effects are simply cannot be seen from just
performance tests. For example, consider two different
instructions which introduce exactly the same delay, but one of
them due to explicitly requested processor pause, and another one
due to a calculation which requires the same time. There will be
no performance difference between the two - still, there will be a
difference in power consumed by the CPU.
> 2.
>
> For code alignment question, I tried -falign-{functions,jumps}=64.
> ministat say's no diff.
>
> x Baseline
> + BaselinewAlign
> +----------------------------------------------------------------------+
> | xx* |
> |+ x + + x+ *x* ++ x+ ++*+ x x + x x|
> | |_______M______A_______________| |
> | |_____________AM____________| |
> +----------------------------------------------------------------------+
> N Min Max Median Avg Stddev
> x 15 129548 131751 130154 130442 622.46629
> + 15 129000 131376 130306 130273 551.93064
> No difference proven at 95.0% confidence
This might indicate you've measured some other effect, and not the
alignment. Also, it might worth checking in the compiled result
that the alignment is actually applied.
(Note that text/plain part of your message contains garbled text,
I've restored the above quote manually from the text/html part.
It might worth switching to plain text in your mail client for
further messages here.)
> 3.
>
> ministat for comparing blank ngx_cpu_pause() to ISB & YIELD (no memory
> clobber).
>
> Ministat say's significant difference. I have see it where ISB returns
> like ~10% +/- ~2%, however, I'm going to discount that as cloud
> variation/noise. A "lucky run".
>
> That said, it sounds like this is some kind of side effect of adding
> this into the binary as you mentioned previously. This diff oddly
> consistent though, or at least oddly consistent dumb luck.
>
> x Baseline
> + ISB
> * YIELD
> +--------------------------------------------------------------------------------+
> | xxx * + + + |
> |x + x xxx x ** *xx *** * x **** *+ + * + * + +|
> | |______M____A___________| |
> | |______________MA_______________| |
> | |_________A__M_______| |
> +--------------------------------------------------------------------------------+
> N Min Max Median Avg Stddev
> x 15 129548 131751 130154 130442 622.46629
> + 15 129778.64 133639.52 132108.5 132135.41 844.66065
> Difference at 95.0% confidence
> 1693.41 +/- 554.832
> 1.29821% +/- 0.425348%
> (Student's t, pooled s = 741.929)
> * 15 130679 132621 131596 131486.47 540.21198
> Difference at 95.0% confidence
> 1044.47 +/- 435.826
> 0.800713% +/- 0.334115%
> (Student's t, pooled s = 582.792)
That's without any caching being used, that is, basically just a
result of slightly different compilation, correct?
This might be seen as a reference point of how slightly different
compilation can affect performance. We've previously seen
cases of 2-3% performance improvement observed as a result of a
nop change, and these results seem to be in line.
Tuning compilation to ensure there is no difference here might be
the way to go.
--
Maxim Dounin
http://mdounin.ru/
More information about the nginx-devel
mailing list