Maybe you have heard of QUIC, a new transport protocol that combines functionalities of TCP and TLS, plus some new features like parallel handling of streams and latency minimization through 0-RTT. The work started at Google 5 years ago, as part of their effort to speed up the web. The IETF picked up the early work from Google, and is working through its standardization. There is a great deal of innovation happening there, and it is made possible by a series of design choices, such as running over UDP to cross NAT and firewalls, complete encryption of the transport headers to prevent interference from the middle of the network, and implementation in user space. It is all very cool, but there is a potential danger. Implementations can innovate and deploy new congestion control algorithms pretty much at will. That could trigger a race to the bottom, and potentially break the Internet, unless network providers start deploying adequate Active Queue Management algorithms.
Before 1988, the Internet was occasionally victim of congestion events. I was working at INRIA in France at the time, and some links in our early TCP-IP network were quite slow, X.25 virtual circuits that maxed out at 9600 bps. It worked some of the time, and then there would be a congestion event. Some node would start sending too fast, queues would build up, delays would increase and trigger retransmissions which would contribute to more queues, and very soon nothing useful could be done. This was eventually fixed in 1988, with the work of Lixia Zhang on adaptive timers and of Van Jacobson on congestion control algorithms – the algorithm that came to be known as “TCP Reno”.
The Internet grew a lot since 1988. The link speed quickly reached megabits per second, and then gigabits. Some people may complain that their connection is too slow, but we have not experienced anything like the congestion collapses of the early era. The traditional stance is that the Internet only remained stable because most of the traffic uses TCP and uses congestion control algorithms designed to back-off in the presence of congestion. The design of the TCP stacks tends to be very conservative. There were some updates in the 90’s to accommodate larger transmission speeds and more efficient retransmissions, but the congestion control logic remained very much the same. The only major change happened in the Linux kernel in 2006, with the replacement of “New Reno” by “Cubic” – 18 years after Van Jacobson’s design. Cubic does in fact keep a lot of the design principles of Reno, probing slowly for increased capacity and backing off promptly in case of congestion. This slow rate of change might soon change.
Change was slow in large part because the TCP stacks were shipped as part of the operating systems, and there never were very many operating systems. The operating systems developers are by nature conservative. If you are working on Windows, Linux or FreeBSD, you don’t want to ship a code update that might cause Internet congestion. Anything you do will have to pass multiple reviews, and deviation from standards would require a lot of justification. In fact, that conservative stance was pretty much one of the motivations for developing QUIC as an “application level” protocol. It runs over UDP, and the code is shipped as a library, compiled with the application. Google does not have to wait that Microsoft updates the TCP stack in Windows: it can ship a new version of the QUIC library in a update to the Chrome browser. Instead of getting an operating system update every year or so, the application can be updated every month, maybe even every day. The operating system developers acted as gatekeepers of transport innovation. Shipping the code with application bypasses these gatekeepers.
On one hand, removing the gatekeepers has immediate benefits, because enables development of new transport features. QUIC can for example deploy 0-RTT transmission, so that the first web request can be sent from client to server without waiting for complete establishment of the connection. It mitigates head of queue blocking by managing simultaneous transmission of multiple streams. Future updates may bring partial reliability or forward error correction. This increased deployment agility is most welcome. On the other hand, that same agility enables immediate deployment of “alternative” congestion control algorithms. Some of that may be good, such as designing algorithms that react better than New Reno or Cubic to the specific behavior of wireless links. But there is a risk that developers will abuse this newfound agility and engage in a race towards ever more aggressive behaviors. If everybody tunes their algorithms to be a little faster than the competition, the net result could well be a return of congestion collapses, just like we saw in the pre-1988 Internet.
In fact, Google already started shipping QUIC with a new congestion control algorithm called BBR. BBR operates by tracking the end to end delay and trying to maintain is small, backing off when an increase in sending rate creates a disproportionate increase in queuing delay. It does deliver high bandwidth and low latency, which is great. But it also delivers something else: domination over the competitors. The graph above is taken from a presentation by Geoff Huston at the RIPE 76 conference. It shows what happens when a 10 Gbps link is shared between two connections, one managed using the Cubic algorithm that ships with Linux and Windows, and the other with BBR. In Geoff’s words, “that was brutal”: very soon after the BBR connection starts, the throughput of the Cubic connection drops to almost zero. I don’t have actual measurements, but it may well be that if two users are trying to access the web from the same home, the one accessing Google servers using Chrome will get most of bandwidth, while the other accessing some other server will be left with a very slow connection.
Let’s be clear and not rush to assign bad intentions to my colleagues at Google. There is no proof that the dramatic effect shown in Geoff’s slide was deliberate. It may well be the unintended side effect of the design priority, achieving low latency. It may well be that the design will be fixed to achieve better behavior, since after all the code can be readily updated. Geoff Huston’s detailed analysis of BBR shows that there are indeed some efforts in that direction. But still, the optics are not good.
Faced with results like that, the competitors are going to react. They cannot really accept to be pushed to a slow lane. The most expedient development could be to just use QUIC, since the IETF standardization has well progressed, and of course ship it with their own copy of BBR. But there is the latent temptation to do even better, to “move to an even faster lane”. Maybe they will just change a couple of parameter values in the BBR algorithm to make it “faster”. Maybe they will invent their very own congestion control algorithm. Remember, there are no gatekeepers any more. If you can ship the same software to your app and to your servers, you can compete.
That’s not an entirely new problem. Some applications have been accused to do that for a long time. For example, some video conference applications are said to send video just as fast as they could, because they are “real-time”. (In fact, congestion control for video conferences was demonstrated in the early 90’s.) But having wide spread deployment of new congestion control algorithms changes the game. In fact, it exposes a gap in the current Internet architecture. The gatekeepers in the operating systems pretty much guaranteed that network managers could rely on end-to-end congestion control to guarantee stability. Well, not anymore.
The solution is probably a widespread deployment of Active Queue Management. There are many AQM algorithms. Random Early Detection and Explicit Congestion Notification were proposed years ago. More recent algorithms were developed in the IETF AQM Working Group to address the “buffer bloat” problem, such as FQ-CODEL or PIE. My hope is that we will see a virtuous cycle: AQM algorithms get deployed at bottlenecks and enforce that each class of traffic gets a fair share. If network providers do that, there will be much less incentive to deploy “unfair” congestion control algorithms in applications. On the contrary, the algorithms will evolve to stay within the allocated fair share, while keeping queues as short as possible. That would be much better than collapsing the Internet!
That’s interesting. I’ve been flabbergasted years ago by the lack of congestion control between computers on the same network. Worse, it took twenty years for french (i don’t know about others) telecom companies to get the right quality of service for phone over adsl. It feels very much like 20th century technology to me.
While I agree with your thesis, about the problem!
I am very bothered by your descriptions of the timelines and who were involved. Other processes are required long before something hits the ietf and recent attempts to file the serial numbers off in favor of corporate “innovation” rather bug me, so:
1) “More recent algorithms were developed in the IETF AQM Working Group to address the “buffer bloat” problem, such as FQ-CODEL or PIE. ”
fq_codel sprang from an outside the ietf effort (bufferbloat.net) founded by myself and jim gettys in 2011. In may of 2012 (after many other innovations in linux such as BQL (which made FQ and AQM technology possible), multiple other fixes in the stack, the publication of Van Jacobson’s and Kathie nichols’s paper on codel (which we turned from ns2 to linux in a week, and arrived in linux mainline the week later)… and two weeks later .- fq_codel incorporated the best of all our research. It took 6 hours to write, and while there have been many minor tweaks along the way, it then took 6 years to standardize in the IETF while achieving a near total deployment in linux today, and is now in freebsd.
The ietf AQM working group was founded only because of VJ and Kathie’s breakthrough AQM design. It was a hard fight to even get fair queuing as part of the charter.
2) QUIC’s real history started with a renegade engineer (Jim Roskind, father of QUIC) that gathered a small team inside google to re-examine tcp in the context of web traffic around 2011 – 3 years before you claim it happened. See the commit logs. They re-evaluated 30 years of discarded tcp ideas, and retried them, much in the manner of how edison would try 3000 ideas to get one. Month in month out, they built release after release, wrote the code, deployed it, made changes to the code and protocol on a monthly basis. They faced enormous barriers by folk that thought we should just fix tcp, or laughed at each new idea and said that couldn’t (or shouldn’t) be done.
They just went ahead and did it.
Every time they made a “breaking change” in the protocol they bumped the version number. Sometimes it was crypto, sometimes frames, sometimes congestion control, etc.
It went through *20* *deployed* revisions before it made the ietf.
Looking at the wire spec https://docs.google.com/document/d/1WJvyZflAO2pq77yOLbp9NsGjC1CHetAXV8I0fQe-B_U/edit
You can see the long list of recent versions:
Q009: added priority as the first 4 bytes on spdy streams.
Q010: renumber the various frame types
Q011: shrunk the fnv128 hash on NULL encrypted packets from 16 bytes to 12 bytes.
Q012: optimize the ack frame format to reduce the size and better handle ranges of nacks, which should make truncated acks virtually impossible. Also adding an explicit flag for truncated acks and moving the ack outside of the connection close frame.
Q013: Compressed headers for *all* data streams are serialized into a reserved stream. This ensures serialized handling of headers, independent of stream cancellation notification.
Q014: Added WINDOW_UPDATE and BLOCKED frames, no behavioral change.
Q015: Removes the accumulated_number_of_lost_packets field from the TCP and inter arrival congestion feedback frames and adds an explicit list of recovered packets to the ack frame.
Q016: Breaks out the sent_info field from the ACK frame into a new STOP_WAITING frame.
Changed GUID to Connection ID
Q017: Adds stream level flow control
Q018: Added a PING frame
Q019: Adds session/connection level flow control
Q020: Allow endpoints to set different stream/session flow control windows
Q021: Crypto and headers streams are flow controlled (at stream level)
Q023: Ack frames include packet timestamps
Q024: HTTP/2-style header compression
Q025: HTTP/2-style header keys. Removal of error_details from the RST_STREAM frame.
Q026: Token binding, adds expected leaf cert (XLCT) tag to client hello
Q027: Adds a nonce to the server hello
Q029: Server and client honor QUIC_STREAM_NO_ERROR on early response
Q030: Add server side support of certificate transparency.
Q031: Adds a SHA256 hash of the serialized client hello messages to crypto proof.
Q032: FEC related fields are removed from wire format.
Q033: Adds an optional diversification nonce to packet headers, and eliminates the 2 byte and 4 byte connection ID length public flags.
Q034: Removes entropy and private flags and changes the ack frame from nack ranges to ack ranges and removes truncated acks.
Q035: Allows each endpoint to independently set maximum number of supported incoming streams using the MIDS (“Maximum Incoming Dynamic Streams”) tag instead of the older MSPC (“Maximum Streams Per Connection”) tag.
Q036: Adds support for inducing head-of-line blocking between streams via the new FHOL tag in the handshake.
Jim Roskind is a hero.
…
Your article also conflates QUIC with BBR. BBR is a congestion control algorithm. QUIC is a protocol that can use any congestion control algorithm.
Thanks for the insights. It is a good read, something I was really looking forward to indeed. Keep writing more of such masterpieces.