Optimizing NGINX TLS Time To First Byte (TTTFB)

Thu Dec 19 23:55:05 UTC 2013

On Thu, Dec 19, 2013 at 2:51 AM, Anton Yuzhaninov <citrin at citrin.ru> wrote:

> On 12/19/13 04:50, Alex wrote:
>
>> I remember reading (I believe it was in your (excellent) book! ;)) that
>> upon packet loss, the full TLS record has to be retransmitted. Not cool
>> if the TLS record is large and fragmented. So that's indeed a good
>> reason to keep TLS records small and preferably within the size of a TCP
>> segment.
>>
>
> Why TCP retransmit for single lost packet is not enough (in kernel TCP
> stack, whit is unaware of TLS record)?
> Kernel on receiver side, should wait for this lost packet to retransmit,
> and return data to application in same order as it was sent.
>

Yep, no need to retransmit the record, just the lost packet... The entire
record is buffered on the client until all the packets are available, after
that the MAC is verified and contents are decrypted + finally passed to the
application.

On Wed, Dec 18, 2013 at 4:50 PM, Alex <alex at zeitgeist.se> wrote:

> On 2013-12-19 01:04, Ilya Grigorik wrote:
>
>
> FWIW, for these exact reasons the Google frontend servers have been using
> TLS record = TCP segment for a few years now... So there is good precedent
> to using this as a default.
>
> Yeah, about that. Google's implementation looks very nice. I keep
> looking at it in Wireshark and wonder if there is a way that I could
> replicate their implementation with my limited knowledge. It probably
> requires tuning of the underlying application as well? Google uses a
> 1470 bytes frame size (14 bytes header plus 1456 bytes payload), with
> the TLS record fixed at ~ 1411 bytes. Not sure if a MTU 1470 / MSS 1430
> is any beneficial for TLS communication.
>
> They optimized the stack to almost always _exactly_ fit a TLS record
> into the available space of a TCP segment. If I look at one of my sites,
> https://www.zeitgeist.se, with standard MTU/MSS, and the TLS record size
> fixed to 1370 bytes + overhead, Nginx would happily use the remaining
> space in the TCP record and add part of a second TLS record to it, of
> which the rest then fragments into a second TCP segment. I played around
> with TCP_CORK (tcp_nopush), but it didn't seem to make any difference.
>

Right, I ran into the same issue when testing it on this end. The very
first record goes into first packet, and then some extra (30~50) bytes of
following record are padded into it.. from thereon, most records span two
packets. The difference with GFE's is that they flush the packet on each
record boundary.

Perhaps some nginx guru's can help with this one? :-)

> > That said, small records do incur overhead due to extra framing, plus
> more CPU cycles (more MACs and framing processing). So, in some instances,
> if you're delivering large streams (e.g. video), you may want to use larger
> records... Exposing record size as a configurable option would address this.
>
> Absolutely. Before I said Google uses a 1470 bytes frame size, but that
> is not true for example when it comes to streaming from Youtube. Here
> they use the standard MTU, and also large, fragmenting TLS records.

Actually, it should be even smarter: connection starts with small record
sizes to get fast time to first frame (exact same concerns as TTFB for
HTML), and then record size is increased as connection opens up. Not sure
if that's been officially rolled out 100%, but I do know that this was the
plan. The benefit here is there is no application tweaking required. I'd
love to see this in nginx as well.

On Thu, Dec 19, 2013 at 5:15 AM, Maxim Dounin <mdounin at mdounin.ru> wrote:

> Hello!
>
> > In theory, I agree with you, but in practice even while trying to play
> with
> > this on my own server it appears to be more tricky than that: to
> ~reliably
> > avoid the CWND overflow I have to set the record size <10k.. There are
> also
> > differences in how the CWND is increased (byte based vs packet based)
> > across different platforms, and other edge cases I'm surely overlooking.
> > Also, while this addresses the CWND overflow during slowstart, smaller
> > records offer additional benefits as they help minimize impact of
> > reordering and packet loss (not eliminate, but reduce its negative impact
> > in some cases).
>
> The problem that there are even more edge cases with packet-sized
> records.  Also, in practice with packet-sized records there seems
> to be significant difference in throughput.  In my limited testing
> packet-sized records resulted in 2x slowdown on large responses.
> Of course the overhead may be somewhat reduced by applying smaller
> records deeper in the code, but a) even in theory, there is some
> overhead, and b) it doesn't looks like a trivial task when using
> OpenSSL.  Additionally, there may be wierd "Nagle vs. delayed ack"
> related effects on fast connections, it needs additional
> investigation.

> As of now, I tend to think that 4k (or 8k on systems with IW10)
> buffer size is optimal for latency-sensitive workloads.
>

If we assume that new systems are using IW10 (which I think is reasonable),
then an 8K default is a good / simple middle-ground.

Alternatively, what are your thoughts on making this adjustment
dynamically? Start the connection with small record size, then bump it to
higher limit? In theory, that would also avoid the extra config flag.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20131219/4ead79da/attachment-0001.html>