OK, I have an incredibly weird nginx connection issue.
I have a cluster of boxes that are responsible for terminating SSL requests and passing them to a local haproxy instance for further routing. I have corosync/pacemaker setup to manage the IP addresses and failover instances if there’s an issue.
This server has been running fine for a long time, but we recently had to reboot because of the GHOST stuff. Before we did that, we did an apt-get upgrade to get to the latest Debian Wheezy packages, including a new nginx (1.6.2), openssl, kernel, and just about
After that happened, we started seeing connection issues to the nginx that does SSL termination. We When it was happening, about 50% of our requests were timing out (iOS/Android clients). I was testing manually using curl when it was happening, and we were seeing huge fluctuations in the time it takes to connect. I saw a lot of connections just timing out completely, in combination with connections take 1s, 3s, 15s, 30s, etc…
When this issue was happening to nginx, haproxy on the same box was unaffected, tested by curling every second from a box close to it, logging the results and verifying results. So, it seemed to just be SSL with nginx.
Now that our peak load is down, it’s not as big an issue, but we are still seeing connection issues when I curl, just more like 1-3s typically, just not as many. Since we’ve had some time to experiment, I’ve gathered more information that makes no sense to me.
Almost all the traffic was setup to go to the address managed by corosync. When I setup my curl tests to run every second, I see the timeouts. SO, I tried something. I bound the main ip address of the NIC to nginx, reloaded, and redid the same test, but pointed the curl to go to the main ip address. As soon as I did that, my curl tests never saw a single issue and the connect phase never takes more than 2ms and no timeouts.
So, I started thinking it was the corosync IP, so I sent all our traffic to go to the main nic ip address that just tested fine, and once the normal traffic levels switched over to main nic, I started seeing curl timeouts now that it had traffic. So, I then started curling the IP from corosync that used to be primary, and now IT has no connection issues.
So, I have connection issues to nginx but only on the IP address that takes the traffic. nginx on a different IP on the same NIC is fine. haproxy on the same NIC as fine.
What the heck? Struggling to think of anything I could tweak. This doesn’t make sense, but I have triple checked my info, and it’s legit.