Hi all,
The Wikimedia Foundation has been running nginx-1.9.3 patched for
multi-certificate support for all production TLS traffic for a few
weeks now without incident, for all inbound requests to Wikipedia and
other associated projects of the Foundation.
We initially used the older March variant of Filipe's patches (
http://mailman.nginx.org/pipermail/nginx-devel/2015-March/006734.html
), and last week we switched to using the April 27 variant (
http://mailman.nginx.org/pipermail/nginx-devel/2015-April/006863.html
), which is the last known public variant I'm aware of.
These were in turn based on kyprizel's patch (
http://mailman.nginx.org/pipermail/nginx-devel/2015-March/006668.html
), which was based on Rob's patch from nearly two years ago (
http://mailman.nginx.org/pipermail/nginx-devel/2013-October/004376.html
). It has a long and colorful history at this point :)
We've forward-ported Filipe's Apr 27 variant onto Debian's 1.9.3-1
package. Most of the porting was trivial (offsets / whitespace /
etc). There were a couple of slightly more substantial issues around
the newer OCSP Stapling valid-timestamp checking, and the porting of
the general multi-cert work to the newer stream modules. The
ported/updated variant of the patches we're running is available here
in our repo:
https://github.com/wikimedia/operations-software-nginx/blob/wmf-1.9.3-1/deb…
Our configuration uses a pair of otherwise-identical RSA and ECDSA
keys and an external OCSP ssl_stapling_file (certs are from
GlobalSign, chain/OCSP info is identical in the pair). Our typical
relevant config fragment in the server section looks like this:
------------
ssl_certificate /etc/ssl/localcerts/ecc-uni.wikimedia.org.chained.crt;
ssl_certificate_key /etc/ssl/private/ecc-uni.wikimedia.org.key;
ssl_certificate /etc/ssl/localcerts/uni.wikimedia.org.chained.crt;
ssl_certificate_key /etc/ssl/private/uni.wikimedia.org.key;
ssl_stapling on;
ssl_stapling_file /var/cache/ocsp/unified.ocsp;
-------------
Obviously, we'd rather get this work (or something similar) upstreamed
so that we don't have to maintain local patches for this indefinitely,
and so that everyone else can use it easily too. I'm assuming the
reason it wasn't merged in the past is there may be other issues
blocking the merge that just weren't relevant to our particular
configuration, or are just matters of cleanliness or implementation
detail.
I'd be happy to work with whoever on resolving that and getting this
patchset into a merge-able state. Does anyone know what the
outstanding issues were/are? Some of the past list traffic on this is
a bit fragmented.
Thanks,
-- Brandon
Hello!
A couple of years ago, I've reported the following bug:
http://mailman.nginx.org/pipermail/nginx-devel/2013-October/004442.html
Responses with empty bodies with the header "Content-Encoding: gzip" used to cause requests to hang.
There has been a fix, but now it seems that the requests simply fails.
Reviewing the code, it appears that the following happens:
- An empty last buffer arrives into the gunzip module's body filter.
- The gunzip module's ngx_http_gunzip_filter_add_data() calculates and input buffer size (it is 0), and it is later in fed to zlib's inflate(), along with the paramter Z_FINISH
- inflate() is later called, and returned Z_BUF_ERROR. This causes error handling to shut down the request and the connection. The client gets an empty response.
I'm not sure what a proper fix would be, but I can suggest the following:
1. In ngx_http_gunzip_header_filter() check the content length, and don't create a gunzip ctx if it is 0.
2. In ngx_http_gunzip_body_filter(), check if gunzip has started ("!ctx->started"). If it hasn't and the input buffer is the last one, simply jump to the next filter. This handles the case that the response with is chunked encoding.
Would be great to hear the development team's opinion.
Best regards,
Aviram
There was a thread on the nginx mailing list last week, regarding upstream keepalive connections being placed in an invalid state due to a partially-transmitted request body. With regard to that discussion, I’m submitting two patches for your review.
The first adds a test case to nginx-tests demonstrating the problem as of nginx 1.9.7. Most of the change involves extending the mock origin to consume a request body, and verify the method transmitted. Currently, nginx will reuse the upstream connection for a subsequent request and (from the point of view of an upstream client) insert some or all of a request line and headers into the previous request's body. The result is typically a 400 Bad Request error due to a malformed request.
The second patch fixes this bug using the method suggested by Maxim, i.e. close the upstream connection when a response is received before the request body is completely sent. This is the behaviour suggested in RFC 2616 section 8.2.2. The relevant Trac issue is #669.
details: http://hg.nginx.org/nginx/rev/7e241b36819d
branches:
changeset: 6308:7e241b36819d
user: Ruslan Ermilov <ru(a)nginx.com>
date: Mon Nov 30 12:04:29 2015 +0300
description:
Configure: improved workaround for system perl on OS X.
The workaround from baf2816d556d stopped to work because the order of
"-arch x86_64" and "-arch i386" has changed.
diffstat:
auto/lib/perl/conf | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diffs (12 lines):
diff -r 2c8874c22073 -r 7e241b36819d auto/lib/perl/conf
--- a/auto/lib/perl/conf Mon Nov 30 19:01:53 2015 +0300
+++ b/auto/lib/perl/conf Mon Nov 30 12:04:29 2015 +0300
@@ -57,7 +57,7 @@ if test -n "$NGX_PERL_VER"; then
if [ "$NGX_SYSTEM" = "Darwin" ]; then
# OS X system perl wants to link universal binaries
ngx_perl_ldopts=`echo $ngx_perl_ldopts \
- | sed -e 's/-arch x86_64 -arch i386//'`
+ | sed -e 's/-arch i386//' -e 's/-arch x86_64//'`
fi
CORE_LINK="$CORE_LINK $ngx_perl_ldopts"
Hello,
On a system with a load of about 500-600 URI/sec I see some unexpected
behaviour when using "aio threads" option in the configuration.
System setup:
The system runs on RHEL6.6 with 3 workers running nginx 1.9.6 with
thread support. Content is cached and populated by a proxied-upstream.
The cache location is a tmpfs file system with more then enough space
at all times. Proxy buffer size 8k. The output buffer is default (no
config item, so 2 32k). Keepalive timeout 75s. Sendfile is enabled.
Seen behaviour:
On the WAF in front of this system I see occasional hangs on resources
(mainly larger files like js, jpeg, ..). Seen in the WAF log is that
this WAF waits for the transfer to be completed until nginx closes the
connection at the keepalive time of 75s. In the nginx access.log I see
the entry served from cache (upstream server '-') with the correct
content length. In the tcp dump I see the response of this call to
contain a content-length header with the correct length, a server time
header over 1 minute older then the tcpdump timestamp (all servers are
ntp-connected). The served jpeg is half-way in its cache lifetime at
that time and there are previous served entries from cache without
incomplete transfers. In the tcp dump the jpeg file starts to differ
from the original after 32168 bytes and misses 8192 bytes after which
the remaining content is served (which is identical to original). From
the tcpdump I can extract the file which is missing 8192 bytes.
We have also a dump when during the proxied call this same behaviour
was seen. The upstream call is started to get a jpeg from the origin.
After a few packets the data is sent to the WAF. The complete upstream
file is retrieved (can be validated in the tcpdump that the jpeg is
complete and correctly retrieved), but not all the data is sent to the
listening socket to the WAF.
If I change the setup to "aio on" or "aio off" this behaviour is not
seen. This is the only change in the configuration between the tests.
It looks like this behaviour only affects bigger files. I have not seen
this effect on small files or proxied responses.
Does anyone have the same experience with this option. And what is the
best way to proceed in tracing this?
Regards,
B.
Hi all,
I couldn't find anything in the mailing list about this issue, surely we
are not the only one?
When activating reuseport I am seeing all requests be served from a single
nginx process. All others are just idling (SIGALARM interruption of
epoll_wait / epoll_wait timeout according to strace).
Process 442 attached - interrupt to quit
epoll_wait(60, 8225010, 512, 4294967295) = -1 EINTR (Interrupted system
call)
--- SIGALRM (Alarm clock) @ 0 (0) ---
rt_sigreturn(0xe) = -1 EINTR (Interrupted system call)
epoll_wait(60, 8225010, 512, 4294967295) = -1 EINTR (Interrupted system
call)
--- SIGALRM (Alarm clock) @ 0 (0) ---
This only occurs with reuseport, as soon as it is disabled the load is
correctly distributed again.
Configuration:
worker_processes 12; # 2x8 cores on server
multiple server blocks on different IP's and ports with reuseaddr.
Linux kernel: 3.18.20
Server nic has interrupts over all cores:
# sudo ethtool -S eth0 |grep rx | grep pack
rx_packets: 11244443305
rx_queue_0_packets: 1381842455
rx_queue_1_packets: 1373383493
rx_queue_2_packets: 1490287703
rx_queue_3_packets: 1440591930
rx_queue_4_packets: 1378550073
rx_queue_5_packets: 1373473609
rx_queue_6_packets: 1437806438
We have also experimented with disabling iptables and anything else on the
server that could be interfering. I have also loaded it onto three other
fresh servers with the same kernel (same OS image), but with different nic
cards (with and without multiple rx queues) with no changes.
This has me stumped. Ideas?
Regards,
Mathew