high "Load Average"

Tue Mar 16 18:53:28 MSK 2010

Stefan Parvu Wrote:
-------------------------------------------------------
> I meant nicstat. Works on Linux, Solaris.
> Probable worth of considering porting this to FreeBSD too.

here are latest results from netstat and nicstat tools

netstat -ant | grep ESTABLISHED | egrep ':80\>' | wc -l
1157

We can say 1157 HTTP sessions are established (and may be kept alive)

nicstat

    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
10:24:02       lo    0.33    0.33   290.8   290.8    1.16    1.16  0.00   0.00
10:24:02     eth0    0.29    0.79  1183.8  1448.1    0.25    0.56  0.00   0.00
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
10:24:03       lo   153.7   153.7   338.6   338.6   464.7   464.7  0.00   0.00
10:24:03     eth0  1070.9   768.7  3599.4  3987.9   304.7   197.4  0.00   0.00
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
10:24:04       lo   65.22   65.22   191.9   191.9   348.1   348.1  0.00   0.00
10:24:04     eth0   932.5   691.3  3250.8  3437.7   293.7   205.9  0.00   0.00
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
10:24:05       lo   522.5   522.5   687.5   687.5   778.2   778.2  0.00   0.00
10:24:05     eth0  1075.6   988.8  3758.0  4308.4   293.1   235.0  0.00   0.00
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
10:24:06       lo   235.0   235.0   551.9   551.9   436.0   436.0  0.00   0.00
10:24:06     eth0   855.4   857.6  3476.3  3761.2   252.0   233.5  0.00   0.00
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
10:24:07       lo   135.3   135.3   351.9   351.9   393.6   393.6  0.00   0.00
10:24:07     eth0  1021.0  1025.4  3970.6  4591.4   263.3   228.7  0.00   0.00
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
10:24:08       lo   168.1   168.1   492.3   492.3   349.6   349.6  0.00   0.00
10:24:08     eth0   793.6   864.8  3340.2  3875.5   243.3   228.5  0.00   0.00
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
10:24:09       lo   32.88   32.88   78.97   78.97   426.4   426.4  0.00   0.00
10:24:09     eth0   551.1   462.1  2416.2  2461.1   233.6   192.3  0.00   0.00

> Good is to have latest patches applied etc. To me
> this is a good candidate for  DTrace (Solaris) or SystemTap (linux).
getting acquainted with SystemTap, looks like not "5 minutes to learn" system

Cliff Wells Wrote:
-------------------------------------------------------
I've arranged answers regarding to questions grouped by "theme"

> Well, all it means is that the bottleneck lies outside PHP.   
> I wouldn't read too much else into it.   
> It just means that 99% of your slowness is elsewhere.

> I cannot possibly fathom how the length of T doesn't matter.   
> To me that is probably the single most important data point to investigate.
> The entire response time directly depends on T and you say it "doesn't matter"?

> Why do you think machine A is slow?   
> Machine A waits for machine B and you blame machine A.   
> I don't see how you arrive at this conclusion.
> To me it seems it could be either machine (or the connection between the machines) 
> and more testing needs to be done to arrive at such a conclusion.   

Combined answer on all these questions. Looks like we are talking about not the same thing in here. You are talking about response time for network request issued by user. It really consists of several stages of processing in here. Like
1. HTTP request handled by nginx
2. data processing by PHP (provided via CGI interface with php-fpm in my case)
3. external DB request processing (aka T in our discussion)
4. response to user
and in this case I totally agree with you - time T does matter and directly influences response time.

What I am talking about is a little bit different. In peak hours response time degrades significantly, but is still more or less acceptable, but what is unacceptable is that machine A slows down and replies for external actions (like SSH login, VPN connection) very slowly. For example, I sometimes even can't establish VPN connection to it due to timeouts. (there is openvpn server running on it). That's why I am talking about "slow machine A" and blame it. That's why I am worried about "uninterruptible sleep" processes and thinking about scheduling lag

> I'd be curious to see the output of iotop on your
> MySQL server as well.

here it is:

Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
23144 be/4 mysql       0.00 B/s   19.10 K/s  0.00 %  0.00 % mysqld --basedir=/usr~/lib/mysql/mysql.sock
  880 be/4 mysql       0.00 B/s   11.46 K/s  0.00 %  0.00 % mysqld --basedir=/usr~/lib/mysql/mysql.sock
 1247 be/4 mysql       0.00 B/s   11.46 K/s  0.00 %  0.00 % mysqld --basedir=/usr~/lib/mysql/mysql.sock
 1287 be/4 mysql       0.00 B/s   53.49 K/s  0.00 %  0.00 % mysqld --basedir=/usr~/lib/mysql/mysql.sock
 1305 be/4 mysql       0.00 B/s   22.92 K/s  0.00 %  0.00 % mysqld --basedir=/usr~/lib/mysql/mysql.sock
 7929 be/4 mysql       0.00 B/s   15.28 K/s  0.00 %  0.00 % mysqld --basedir=/usr~/lib/mysql/mysql.sock
30493 be/4 mysql       0.00 B/s   15.28 K/s  0.00 %  0.00 % mysqld --basedir=/usr~/lib/mysql/mysql.sock
10222 be/4 mysql       0.00 B/s    7.64 K/s  0.00 %  0.00 % mysqld --basedir=/usr~/lib/mysql/mysql.sock
    1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init [3]
    2 be/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % 
    3 rt/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % 

and another sample

Total DISK READ: 0.00 B/s | Total DISK WRITE: 107.53 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
  709 be/3 root        0.00 B/s   55.62 K/s  0.00 % 45.85 % 
 4129 be/4 mysql       0.00 B/s    3.71 K/s  0.00 %  0.00 % mysqld --basedir=/usr~/lib/mysql/mysql.sock
 4147 be/4 mysql       0.00 B/s    3.71 K/s  0.00 %  0.00 % mysqld --basedir=/usr~/lib/mysql/mysql.sock
 4233 be/4 mysql       0.00 B/s    7.42 K/s  0.00 %  0.00 % mysqld --basedir=/usr~/lib/mysql/mysql.sock
 4273 be/4 mysql       0.00 B/s    7.42 K/s  0.00 %  0.00 % mysqld --basedir=/usr~/lib/mysql/mysql.sock
 6587 be/4 mysql       0.00 B/s   14.83 K/s  0.00 %  0.00 % mysqld --basedir=/usr~/lib/mysql/mysql.sock
15008 be/4 mysql       0.00 B/s    3.71 K/s  0.00 %  0.00 % mysqld --basedir=/usr~/lib/mysql/mysql.sock
 7449 be/4 mysql       0.00 B/s   11.12 K/s  0.00 %  0.00 % mysqld --basedir=/usr~/lib/mysql/mysql.sock

looks like nothing special for me. MySQL is running, not much I/O

>> May be it is something related to task scheduling? Is big number of sleeping process impacts performance and/or slows down scheduler?

> Not since kernel 2.6, AFAIK.   How many processes are we talking about here?
nginx worker_processes  4
php-pfm max_children 1000 total

and system processes of course, but nothing much

> Maybe.   Or maybe you've simply disguised the
> problem by throwing more processes at it.   How many PHP processes are you
> running?   Can you provide your php-fpm parameters?   Also, what's an
> approximation of your requests per second during these peak times?

2 pools with 500 php-fpm children each and something like 1000-2000 concurrent HTTP sessions

> As an aside, one thing that you mentioned earlier that I was wondering
> about: what is writing to the local disk at 3.27MB/s (from iotop output)?

Total DISK READ: 0.00 B/s | Total DISK WRITE: 3.56 M/s
  TID  PRIO  USER     DISK READ DISK WRITE>  SWAPIN      IO    COMMAND
32722 be/4 nginx       0.00 B/s  207.89 K/s  0.00 %  0.00 % nginx: worker process
22705 be/4 nobody      0.00 B/s  107.30 K/s  0.00 %  0.00 % php-cgi --fpm --fpm-config /etc/php-fp
22235 be/4 nobody      0.00 B/s   83.83 K/s  0.00 %  0.00 % php-cgi --fpm --fpm-config /etc/php-fp
21203 be/4 nobody      0.00 B/s   57.00 K/s  0.00 %  0.00 % php-cgi --fpm --fpm-config /etc/php-fp
32719 be/4 nginx       0.00 B/s   50.30 K/s  0.00 % 55.19 % nginx: worker process
32718 be/4 nginx       0.00 B/s   10.06 K/s  0.00 %  0.00 % nginx: worker process
21195 be/4 nobody      0.00 B/s    3.35 K/s  0.00 %  0.00 % php-cgi --fpm --fpm-config /etc/php-fp
21327 be/4 nobody      0.00 B/s    3.35 K/s  0.00 %  0.00 % php-cgi --fpm --fpm-config /etc/php-fp
22189 be/4 nobody      0.00 B/s    3.35 K/s  0.00 %  0.00 % php-cgi --fpm --fpm-config /etc/php-fp

php-cgi and nginx are writing. php is writing some processed data and logs. Anyway, is 3.5 M/s is not a big deal for SCSI disk, isn't it?

By the way here: http://www.mysqlperformanceblog.com/2006/12/04/using-loadavg-for-performance-optimization/ is couple of words about slowing down systems with active network IO.

Best regards,
Sessna

Posted at Nginx Forum: http://forum.nginx.org/read.php?2,63176,64417#msg-64417