[PATCH] Optimal performance when use http non-persistent connection

Shaokun Zhang zhangshaokun at hisilicon.com
Mon Dec 2 03:22:33 UTC 2019


Hi,

Apologies that reply later.

On 2019/12/2 9:31, sunrui wrote:
> 
> 
> -----邮件原件-----
> 发件人: Maxim Dounin [mailto:mdounin at mdounin.ru] 
> 发送时间: 2019年11月21日 23:22
> 收件人: nginx-devel at nginx.org
> 抄送: sunrui <sunrui26 at huawei.com>
> 主题: Re: [PATCH] Optimal performance when use http non-persistent connection
> 
> Hello!
> 
> On Thu, Nov 21, 2019 at 07:22:16PM +0800, Shaokun Zhang wrote:
> 
>> Hi Maixm,
>>
>> On 2019/11/20 22:29, Maxim Dounin wrote:
>>> Hello!
>>>
>>> On Mon, Nov 11, 2019 at 03:07:02AM +0000, Zhangshaokun wrote:
>>>
>>>> # HG changeset patch
>>>> # User Rui Sun <sunrui26 at huawei.com<mailto:sunrui26 at huawei.com>>
>>>> # Date 1572848389 -28800
>>>> #      Mon Nov 04 14:19:49 2019 +0800
>>>> # Branch local
>>>> # Node ID a5ae6e9e99f747fcb45082bac8795622938184f1
>>>> # Parent  89adf49fe76ada86d84e2af8f5cee9ca8c3dca19
>>>> Optimal performance when use http non-persistent connection
>>>>
>>>> diff -r 89adf49fe76a -r a5ae6e9e99f7 src/core/ngx_cycle.c
>>>> --- a/src/core/ngx_cycle.c        Mon Oct 21 20:22:30 2019 +0300
>>>> +++ b/src/core/ngx_cycle.c     Mon Nov 04 14:19:49 2019 +0800
>>>> @@ -35,6 +35,40 @@
>>>> /* STUB */
>>>>
>>>>
>>>> +void
>>>> +ngx_change_pid_core(ngx_cycle_t *cycle) {
>>>> +    ngx_pid_t           setpid;
>>>> +    ngx_cpuset_t        *setaffinity=NULL;
>>>> +    setpid = ngx_getpid();
>>>> +    {
>>>> +#if (NGX_HAVE_CPU_AFFINITY)
>>>> +        ngx_core_conf_t  *ccf;
>>>> +
>>>> +        ccf = (ngx_core_conf_t *) ngx_get_conf(cycle->conf_ctx, 
>>>> + ngx_core_module);
>>>> +
>>>> +        if (ccf->cpu_affinity == NULL) {
>>>> +            setaffinity = NULL;
>>>> +        }
>>>> +
>>>> +        if (ccf->cpu_affinity_auto) {
>>>> +           setaffinity = NULL;
>>>> +        }
>>>> +
>>>> +        setaffinity = &ccf->cpu_affinity[0];
>>>> +
>>>> +#else
>>>> +
>>>> +        setaffinity = NULL;
>>>> +
>>>> +#endif
>>>> +    }
>>>> +
>>>> +    if (setaffinity)
>>>> +           // set new mask
>>>> +        sched_setaffinity(setpid, sizeof(ngx_cpuset_t), 
>>>> +setaffinity); }
>>>> +
>>>> ngx_cycle_t *
>>>> ngx_init_cycle(ngx_cycle_t *old_cycle) { @@ -278,6 +312,8 @@
>>>>          return NULL;
>>>>      }
>>>>
>>>> +    ngx_change_pid_core(cycle);
>>>> +
>>>>      if (ngx_test_config && !ngx_quiet_mode) {
>>>>          ngx_log_stderr(0, "the configuration file %s syntax is ok",
>>>>                         cycle->conf_file.data);
>>>>
>>>
>>> Sorry, but it is not clear what you are trying to achieve with this 
>>> patch.  You may want to provide more details.
>>>
>>
>> when we test nginx in kunpeng920 which has 2chip and each chip has 2 NUMA.
>> We user 32cores in 2 different NUMA to test nginx, when nginx start 
>> the master worker runs on which core is undefined, when the master's 
>> core and the worker's core in the same chip, the performance of 
>> non-persistent connection is 17W, but when master's core and the 
>> worker's core in the different chip, the performance of non-persistent 
>> connection only has 12W. Now, when nginx start, we migrate master 
>> process depend on the first worker process's cpu affinity, the performance is showed as follow:
>>                                                          | default| 
>> optimize master and worker process on same chip when nginx start  | 
>> 171699 | 176020 master and worker process on diff chip when nginx 
>> start  | 129639 | 180637
> 
> Ok, so you are trying to bind the master process to the same core the first worker process runs on.  Presumably, this can be beneficial from performance point of view in configurations with small number of worker processes, as various structures allocated by the master process after parsing configuration will be allocated from the same NUMA region the worker process runs on.  
> Correct?
> 

Yes,That's it.

> So the following questions are:
> 
> 0. What units of measurement the numbers use?  Connections per second?  What are error margins?
> 

The units is reqs/s, connected several second and calculate the average value, error margins is 3%.

> 1. How did you tested it?  Given that many configuration structures are allocated by the master process during configuration parsing, the numbers look strange.  I would expect performance with master of worker process on different chips to be smaller than that on the same chip, even with the patch applied.  
> Well, with error margins we'll probably see there is no difference between 176020 and 180637, but this brings another question: where the difference between 129639 and 180637 comes from?  Listening sockets created by the kernel on the same chip?  So this probably means we shouldn't bind worker process in general, but rather create listenings sockets on the same chip instead?  Note this is not the same, especially with reuseport, not to mention this cannot be done at all when we inherit listening sockets form previous configurations.
> 

With error margins we consider there is no difference between 176020 and 180637.  The difference between 129639 and 180637 is cause by the cross-chip between listening sockets and worker processes.
We bind worker process because some times we run serveral nginx instances in the same time, we can use worker processes to adapt to net init to improve the performance. If we don't bind worker the patch will not work.
When open the reuseport, the loss will not exist, no mater with the patch or without the patch, it's not effect the result.
When we inherit listening sockets form previous configurations, if we don't bind worker process, the patch will not work, if we bind worker process, we should bind it to the same chip with the listening sockets inherit from previous configurations.

> 2. What happens when there are multiple worker processes?  Will this change still be beneficial, or negative, or neutral?  Don't you think the case you are trying to optimize is too narrow to care about?
> 

You means serveral nginx instances or one mater with serveral worker processes? If you means one master and serveral worker processes, we test 32core and each core has a worker process, so our test situation is 1 master with 32 worker processes.

> 3. In nginx, there are platform-independent functions
> ngx_get_cpu_affinity() and ngx_setaffinity() to work with CPU affinity.  Why you are not using them in your patch?  
> Additionally, why you are not trying to bind master process to a particular CPU with "worker_cpu_affinity auto;"?

ngx_get_cpu_affinity() and ngx_setaffinity() variable which is named "ngx_cycle", we assigns values to it after ngx_init_cycle() is finished, so I can't use them.
It just in the situation which workers are concentrated in the same chip, and the master and workers are in diff chip, the performance shows mach loss, which is because of the cross-chip delay. When the master and worker are random distribute, no mater which chip the master on should be similar, the migrate is no needed.

Thanks,

> 
> --
> Maxim Dounin
> http://mdounin.ru/
> 



More information about the nginx-devel mailing list