The Pi-hole and DNS privacy

People who install it love the Pi-hole. The Pi-hole is a DNS software server, typically running on a Rasberry Pi, that can filter the DNS requests coming out of a local network and, for example, drop connections to advertisers. Once you install that, web browsing becomes notably snappier, not to mention much more private. But there is a catch. If you also configure your device or your browser to connect to an encrypted service, for example using DNS over HTTPS (DoH) to route DNS requests to Google, Cloudflare or Quad9, the DNS requests will just go to that server, bypassing the Pi-Hole and its filtering.

We have known of the tension between encrypted DNS and DNS filtering for some time. Encrypted DNS is specifically designed to prevent intermediaries from meddling with the DNS traffic. Paul Vixie, for example, has been arguing for a long time that DoH is shifting power from network managers to big sites, and is in effect building an unbreakable pipeline for serving ads. Of course, the more people deploy Pi-hole style filtering, the more advertisers will push for solutions like DoH that bypass these filters.

DNS filtering does improve privacy in the same way that Ad Blockers do. Consider that when loading a page, the advertisement systems run an auction to determine which advertiser will win the right to place an impression on this page. In the process, your browser history is captured not just by this advertiser that wins the auction, but by all advertisers that participate in the auction. I personally believe that this trafficking in private data should be illegal, but as of now in the USA it is not. So, yes, any sensible Internet user should be blocking ads. But blocking ads is not enough. The DNS traffic still flows in clear text, the upstream intermediaries still collect DNS metadata. They can and probably will sell your browser history to advertisers. They can also implement their own filtering, censoring sites for political or commercial reasons. Only encryption can prevent that.

So, what next? First, it is kind of obvious that systems like the Pi-hole should be using DNS encryption to fetch DNS data. Instead of merely passing the filtered DNS requests to the local server and rely on its goodwill, they should forward them to a trusted service using encrypted DNS, be it DoH, DNS over TLS (DoT) or DNS over Quic (DoQ). It would also be a good idea to let the Pi-hole work as an encrypted DNS server, so that the browsers might connect directly to it than going all the way to Google or Cloudflare, although the discovery protocols to enable just that are still being standardized in the IETF ADD WG. But systems like the Pi-hole only protect people when they are browsing at home, in the network in which they have installed the filters.

For mobile devices, we need end to end solutions, solutions that work whether or not the local network is doing appropriate filtering. We also want a solution that is independent of the advertisers, because I certainly agree with Paul Vixie that sending all your browser history to Google does not improve privacy. We cannot only rely on personal DNS servers, because then advertisers will be able to attribute server traffic to specific users. What we want are moderate scale servers, with enough users so that traffic cannot be directly traced to individuals. We know how to do that technically: deploy in the network something like a version of the Pi-hole capable of DNS encryption. What we don’t have is a model in which such servers can recoup their costs without relying on advertisements!

Posted in Uncategorized | Leave a comment

Fixing a potential DOS issue in the QUIC handshake

Back in May, there was a discussion on the QUIC implementers’ chat room. What if a client played games with acknowledgements during the initial exchange? Could it be used in a DOS amplification attack? Or maybe some other form of DOS attack?

In the proposed attack the attacker sends a first packet, which typically contains the TLS 1.3 “Client Hello” and its extensions. The server will respond with up to three packets, the first one of which will be an Initial packet containing the TLS 1.3 “Server Hello” and its clear text extensions, followed by Handshake packets containing the reminder of the TLS server flight, or at least part of it. These packets will be sent to the “client address” as read from the incoming UDP message. The attacker has set that to the IP address of the target, not its own. It will not receive them, and it cannot guess the encryption key required to decrypt the handshake packets. But the attacker can guess that the packet number of the Initial packet was zero, because that’s the default. The attacker can thus create a second Initial packet, in which it acknowledges the packet number 0 from the server.

Showing a modified QUIC handshake, spoofed ACK validates the connection, server repeats many packets.
Spoofed ACK attack for DDOS Amplification

One risk is that the server accepts the incoming acknowledgement and treats the IP address of the client as “validated”. The DOS amplification protections would be lifted, and the server may start sending more packets towards the IP address of the client – or rather, towards the IP address of the target in our case. That would be bad.

It turns out that the picoquic implementation is protected against that attack: it will only validate the address if the incoming Initial packet carries the “connection ID” set by the server in its initial message. The only problem happens when the server uses zero-length connection-ID – a rare deployment, but one in which the fake packet would manage to validate the connection. That was a bug. It is now fixed. With that, the “fake acknowledgement” attack will not be able to use for amplifying DOS attacks.

That does not mean that the spoofed acknowledgement isn’t bad. It becomes just one of the many ways for third parties to spoof packets and interfere with the state of the connection. It is also an attack that is quite easy to fix. The attack depends on predicting the Initial packet number used by the server. It is easy to randomize that number. A recent code update just makes that the default behavior. As an added benefit, this means that acknowledgement of randomized packet numbers can be used to validate the client address, which improves the robustness of the connection establishment.

Posted in Uncategorized | Leave a comment

A Low Latency Internet with L4S

Many network applications work better with low latency: not only video conference, but also video games and many more transaction-based applications. The usual answer is to implement some form of QoS control in the network, managing “real-time” applications as a class apart from bulk transfers. But QoS is a special service, often billed as an extra, requiring extra management, and relying on the cooperation of all network providers on the path. That has proven very hard to deploy. What if we adopted a completely different approach, in which all applications benefited from lower latency? That basically the promise of the L4S architecture, which has two components: a simple Active Queue Management implemented in network routers provides feedback to the applications through Early Congestion Notification marks, and an end-to-end congestion control named “Prague” uses these marks to regulate network transmissions and avoid building queues. The goal is to obtain Low Latency and Low Losses, the 4 L in L4S – the final S stands for “Scalable Throughput”.

I am developing an implementation of the Prague algorithm for QUIC in Picoquic, updating an early implementation by Quentin Deconinck (see this PR). I am testing the implementation using network simulations. The first test simulates using L4S when controlling the transfer of 4 successive 1MB files on a 10 Mbps connection, and the following QLOG visualization shows that after the initial “slow start”, the network traffic is indeed well controlled:

QLOG Visualization, simulation with Prague CC

For comparison, using the classic New Reno algorithm on the same simulated network shows a very different behavior:

QLOG Visualization, simulation with New Reno CC.

The Reno graph shows the characteristic saw-tooth pattern, with the latency climbing as the buffers fill up, until some packets are drop and the window size is reset. In contrast, the Prague shows a much smoother behavior. It is also more efficient:

  • The transfer time is reduced to 3.7 seconds with Prague, instead of 3.8 seconds with Reno.
  • The number of packet losses is reduced to 5, versus 45 with Reno.
  • The post-startup maximum latency is reduced to 29ms, versus 40 ms with Reno. (The Reno latency is only capped by the size of the buffers in the simulation’s buffers.)

The implementation effort did find some issues with the present specification: the Prague specification requires smoothing the ECN feedback over several round trips, but the reaction to congestion onset is then too slow; the redundancy between early ECN feedback detecting congestion and packet losses caused by the congestion needs to be managed; something strange is happening when traffic resumes from an idle period; and, more effort should be applied to managing the start-up phase. I expect that these will be fixed as the specification progresses in the IETF. In any case, these numbers are quite encouraging, showing that L4S could indeed deliver on the promise of low loss and low latency, without causing a performance penalty.

Other protocols, like BBR, also deliver low latency without requiring network changes, but there are limits, such as still suffering congestion caused by applications using Cubic, and slow reaction to underlying network changes. BBR is worse than Prague in that respect but performs better on networks that do not yet support L4S, or on high-speed networks. Ideally, we should improve BBR by incorporating the L4S mechanisms. With that, we could evolve the Internet to always deliver low latency, without having to bother with the complexity of managing QoS. That would be great!

Edited, a few hours later:

It turns out that the “strange [thing] happening when traffic resumes from an idle period” was due to an interaction between the packet pacing implementation in Picoquic and the L4S queue management. Pacing is implemented as a leaky bucket, with the rate set to the data rate computed by congestion control and the bucket size set large enough to send small batches of packets. When the source stops sending for a while, the bucket fills up. When it starts sending again, the full bucket allows for quickly sending this batch of packets. After that, pacing kicks in. This builds up a small queue, which is generally not a big issue. But if the pacing rate is larger than the line rate, even by a tiny bit, the queue does not drain, and pretty much all packets get marked as “congestion experienced.”

This was fix by doing a test on the time needed to send a full window of packets. If the sender learns that a lot of packets were marked on a link that was not congested before, and that the “epoch” lasted longer than the RTT, then it classifies the measurement as “suspect” and only slows down transmission a little. With that, the transmission becomes even smoother, and the transfer time in the test drops to less than 3.5 seconds, as seen in this updated visualization:

Posted in Uncategorized | Leave a comment

An introduction service for Mastodon

It seems that many people have started using Mastodon recently. But the place feels a bit empty, especially if like me you join a small server. How do I find my friends on Mastodon? Today, there are two ways: they tell me their handle using some other service, or I stumble upon one of their posts promoted by a common friend. This might work eventually, but it is much less straightforward than just searching by name or email address, something that is easy on Facebook. Of course, finding people should not be too easy, because it would provide avenues for spam or harassment. Facebook provides option, such as only making your account visible to “your friends”, or “the friends of your friends”. These options, and search in general, are straightforward to implement on a centralized server. Can we implement them in a decentralized network?

One obvious and wrong solution would be to broadcast search requests to every server in the whole network. That’s obviously wrong because it creates a huge load, and because it broadcasts private information to a lot of servers who don’t need to know about it. Another obvious and wrong solution is to create a distributed database of all registered names, maybe as a distributed hash table. That would create less load than broadcasting all name queries, but that would still broadcast the name information and create big privacy issues.

My preferred solution would be for users to send the queries to all their friends, as in “Hey, Bob, this is Alice, do you know Carol’s handle?” The load is lower than broadcasting queries. The privacy risks are more contained, although not eliminated. The queries will likely be visible to Bob’s server, and that server can build a database of all of Alice’s friends. Big servers would build bigger databases. This might provide incentives for centralization.

Can we design the queries in ways that minimize the privacy risks? We could use end to end encryption, so that the queries are only visible by the actual user, not their server. We could also encode queries, so they carry the one-way hash of the queried name, as in “Hey, Bob, this is Alice, do you know the handle of the user whose name hashes with this nonce to produce 0xdeadbeef?” Bob will have to compute the hashes of all his friends’ names, see whether one matches, and then return it.

Of course, the security worries do not end there. If we have any kind of automated search, it can be abused. Spammers do that on Facebook today: they manage to befriend someone, download the list of that person’s friends, and then try to befriend every person in that list. Encrypting or hashing queries will not prevent that, but something else might. For example, when receiving the query, Bob might relay it to Carol: “Hey, Carol, this is Bob. I got a message from Alice asking for your handle, shall I respond?” Carol can then decide.

We could also make the messages more complex. Facebook, for example, decorates friend requests with lists of common friends, so the user can make an informed decision. This fall in the general category of “introduction services”. This is hard to design, but we must start somewhere. It would make a ton of sense to add the distributed introduction service to a distributed social network, so maybe we should just start there.

Posted in Uncategorized | Leave a comment

Improving the reliability of the QUIC Handshake

The QUIC specification was finalized by the IETF a year ago, but the interop tests often detect the occasional failure. The most challenging tests are series of 50 connection attempts in a lossy environment, with about 30% packet loss. The tests only pass if all attempts succeed. You might think that 30% packet loss is not a realistic environment, but that’s a very good way to test the implementation of the “state machine” and detect loose ends. One specific loose end is the interaction between packet losses and the “anti amplification” feature of QUIC, which requires servers to limit how much data they send until the client has performed a three-ways handshake.

The typical QUIC handshake proceeds as figure 1: the client sends an “initial” packet, the server responds with its own initial packet that provide the “server hello” and the “handshake” key, and then a couple of “handshake” packets that enables the client to compute the “application” key and finalize the exchange. The important elements there are that the client cannot decrypt the handshake packet if it has not received the initial server packet and will not be able to send its own crypto data until it has received the server handshake packets. The other important element is that by the time the server has sent initial and handshake packets, it has exhausted its “anti-amplification” credits, and will not send anymore data until it receives a packet from the client that completes the three-ways handshake. Now, lets examine what happens if some packets are lost.

Initial and handshake packets in QUIC handshake
Figure 1 – Typical QUIC Handshake

If the client initial is lost, the client will repeat it after a timeout. Similarly, if the client handshake packet is lost, the client will repeat it. All implementations do that. But there are more interesting failures if some of the server packets are lost. The first case is, what if the server’s initial packet is lost?

QUIC handshake diagram, initial server packet lost, server handshake packets ignored,
Figure 2 – What if Initial server packet is lost?

The client that does not receive the server’s initial packet has only one option, because it does not know whether its own packet was lost or whether the server packet was lost. In theory, the client could notice arrival of handshake packets, but those packets cannot be decrypted yet. The client will thus repeat its own Initial packet. Went it receives that, the server has multiple options. A simple and not very helpful one would be to just send back an initial acknowledgement, telling the client to stop repeating. The really bad option is to just repeat the handshake packets, because the client cannot understand them.The more helpful option is to repeat the server’s Initial packet, including the crypto frames, so the client can start processing handshake messages. Proactively repeating these handshake messages would also be helpful. Most implementations handle that case correctly.

The more difficult case is what happens if the handshake packets are lost, and specifically if the client’s initial packet is acknowledged.

QUIC Handshake diagram, handshake packets lost, question mark indiscates need of client action.
Figure 3 – What if handshake packets are lost

Suppose that the handshake packets sent by the server are lost. The client has received the initial packet from the server and can now send or receive handshake packets. The natural option for the client is to send an initial acknowledgement, but if that packet is lost the whole exchange is going to stop:

  • According to the QUIC specification, the “acknowledgement only” packet sent by the client does not trigger an acknowledgement from the server, so per the basic rules the client will not have to repeat it.
  • The server has exhausted its “anti-amplification” credits, so it won’t send anything more.

That scenario has occurred many times during interop testing. The client has to be programmed to break that silence by repeating some packets. There are three options:

  • Repeat an Initial packet
  • Repeat a Handshake packet
  • Repeat both Initial and Handshake packets

My preference is to repeat an Handshake packet, because this demonstrates to the server that the client has received all the required information in the Initial packet, an thus is more or less equivalent to sending both an Initial packet with an ACK and the handshake packet. The handshake packet cannot carry an acknowledgement, because no handshake packet has been reserved, but it can carry a PING frame. If the server is smart, it will decide to immediately repeat the missing packets, but if it is not that might require another round trip, as shown in the following figure:

QUIC Handshake diagram, server handshake packets lost, client repeats handshake packets containing PING messages, forces server to send ACK, upon reply server repeats missing packets.
Figure 4 – Breaking the deadlock

The first handshake packet received by server from client completes the three ways handshake and gets the server out of anti-amplification mode. The not-too-smart server will acknowledge that packet, and the smarter client will immediately acknowledge that Handshake packet, even though it only contained an ACK frame. When the client acknowledges the 3rd handshake packet of the server, most servers will immediately repeat their initial handshake packets, and the QUIC handshake will proceed.

In truth, this procedure is already specified in section 8.1 of RFC 9000. “Loss of an Initial or Handshake packet from the server can cause a deadlock if the client does not send additional Initial or Handshake packets. A deadlock could occur when the server reaches its anti-amplification limit and the client has received acknowledgments for all the data it has sent. In this case, when the client has no reason to send additional packets, the server will be unable to send more data because it has not validated the client’s address. To prevent this deadlock, clients MUST send a packet on a Probe Timeout (PTO); see Section 6.2 of [QUIC-RECOVERY]. Specifically, the client MUST send an Initial packet in a UDP datagram that contains at least 1200 bytes if it does not have Handshake keys, and otherwise send a Handshake packet.” But that short paragraph is often glimpsed over. Let’s hope that belaboring the point a little will help reduce the number of interop failures!

Posted in Uncategorized | Leave a comment

To DoH or not to DoH or to encrypt the SNI

Many Internet Service Providers are nice companies who love their customers, but quite a few have developed a sideline of collecting logs of traffic and selling that to advertisers, or in fact to whoever is willing to pay. Even the FTC is not happy about that. New technology like “DNS over HTTP”, or DoH, enables encrypted DNS resolution through the services of a third party, and thus hide whole chunks of metadata from their ISP. That seems like a win for privacy, until we realize that the main alternative DNS services are provided by big companies like Google or Cloudflare. And so, people have a big dilemma in front of them: do they leak their DNS stream of queries to their ISP, or do they leak it to Google or Cloudflare instead? My concern is that in most cases it is not “instead of” but “in addition to”, at least if we look at the entire picture, not just the DNS stream in isolation. And if you do not look at it in isolation, you realize that the effort should focus on encrypting the SNI in HTTPS connections.

Assume the classic example of loading “”. Today, in practice, that means asking for the IP address of “” through a DNS API, and then establishing an encrypted HTTPS connection to that address. So, who sees what? The DNS provider sees the stream of requests, but these are not the only “muddy footprints”, to reuse the metaphor in Geoff Huston’s presentation. The ISP sees lots of things. It sees the IP traffic, and thus it knows that some customer at IP address X is connecting to some service at IP address Y. The ISP also sees the unencrypted part of the HTTPS connection, which today includes the plain text name “”.

When the average user goes to its ISP’s DNS and then loads the page, the ISP sees the DNS request to “”, the IP addresses in the packet, and the SNI “” in the cleartext data. Google or Cloudflare see nothing of that, unless they happen to have some business relation with the company behind “”. (In many cases they do, but hopefully not all cases.)

When the privacy seeking user tries to hide its footprints by using and encrypted connection to a third party DNS, the ISP still sees the IP addresses in the packet, and the SNI “” in the cleartext data. Getting that data might be a bit less convenient than just extracting a log of DNS queries, but if the ISP is set on extracting revenues from the metadata it will make the effort. So, for the ISP’s business, not much change. But now, instead of just the ISP seeing the data, the third party also sees the DNS stream. So the true description is that, instead of just the ISP collecting metadata, both the ISP and the third party DNS provider do. What privacy did the user gain, exactly?

That why I believe any privacy gain depends on encrypting the SNI. Let’s go back to the IP address resolution. In a lot of cases, the web site “” is not accessed directly, but rather through a content distribution service like, say “”. When the ISP sees the traffic flowing from address X to address Y, address Y does not directly identify “”. Instead, it points to a server managed by “”. Suppose now that “” and “” cooperate and deploy “Encrypted Client Hello”, the emerging specification for encrypting metadata like the SNI. If they do it right, we get to a stage where neither the ISP nor the DNS provider get more metadata than “address X is connecting to one of the services hosted by”.

How does it work? Using ECH properly requires two DNS transactions. In the first transaction, the user’s application learns that “” is accessed through “”, and also learns what key to use to encrypt the metadata. In the second transaction, the application resolves the name “” and obtains the address of the nearest server instance. Now, let’s suppose a little bit of magic that would allow the use to do the first transaction in a private way, and then cache the result.

When it comes to reading the page, we don’t get the same leaks as before. If we use the ISP resolver, the ISP will see a request for “”, traffic from IP X to IP Y corresponding to “”, and an SNI text that has been transformed to say “”. Success. It is hard to see what privacy could be gained by sending the traffic to Google or Cloudflare instead.

Now, of course, there is still a need to perform the first transaction beforehand. The information is stored in a the “HTTPS RR” ( DNS record provided by the target server, in our case By default, it is certainly best to not load that through the ISP, which could then combine two pieces of metadata: looking for the ECH information for “” and then connecting to “”. Instead, it makes sense to look at that information through a third party, and then cache the result.

So, maybe we won’t be leaking information to the ISP, and also not to the big companies deploying third party DNS services. But we do need to encrypt the SNI and learn how to ECH!

Posted in Uncategorized | Leave a comment

One-way delays for multipath QUIC

Multipath transmission is interesting, and I have been working quite a bit on implementing multipath support in QUIC. There are multiple ways to do that some simple and some more systematic, but all designs have the issue of timers. In single-path QUIC, the situation is simple: data is always sent on the same path, acknowledgements are always received on the same path. In multipath QUIC, data is sent on multiple paths, and acknowledgements for that data can be received on any available path. Those paths can well have different latency, and queues could build up at various places. There is no single “round trip time”, but different values depending on the combination of send path and receive path.

I think that the multipath implementations need to estimate one-way delays for each of the paths, and use combinations of one way delays to predict the “end to end” delay expected on a specific combination of send path and acknowledgement path. But then, we have to compute those one way delays.

There is no unique solution, but it does not matter

When computing one-way delays, for each path there is not one estimate of delay but two: the delay from the local node to the peer, and the delay from the peer to the node – we could note them a Dxu and Dxd, where x is the path index and the suffix “u” or “d” indicates a delay for sending “up” on a path “u”, or “down” for receiving on that path for “d”. The delay for “sending to path1 and receiving the ack on path 2” will thus be:

  • R12 = D1u + D2d

If we have N paths, there will be N^2 possible combinations of upload and download, and N^2 different types of measurements. That looks like as many equations as there are variables, but these equations are not linearly independent. See for example the example of two paths:

  • R11 = D1u + D1d
  • R12 = D1u + D2d
  • R21 = D2u + D1d
  • R22 = D2u + D2d

The equation system only has a solution if

  • 0 = R12 – R11 + R21 – R22

In which case it has an infinity of solutions for the form (D1u+x, D1d-x, D2u+x, D2d-x). We can only assess one-way delays if we fix this variable “x” to an arbitrary value. This is not an issue in practice, because we do not directly use the one-way delays. In transport protocols, we use end-to-end delays to assess whether a packet may be lost, and to assess the bandwidth-delay product. Some congestion control protocols operate by monitoring the variation of transmission delays. In both cases, the arbitrary variable “x” does not matter. When computing sums such as “D1u + D1d”, it cancels out. When computing variations such as “D1u(t) – D1u(t-x)”, it also cancels out. So, in practice, it suffices to set the variable x to a constant arbitrary value.

Timestamps are useful

Let’s assume that acknowledgement packets contain a timestamp, as specified for example in the QUIC timestamps draft. With that, the nodes can assess the time at which a packet was sent, the time at which it was received and the acknowledgment sent, and the time at which the acknowledgement was received. Each time a new acknowledgement is received, the sender can measure the one-way delays on the up and down path. The nodes can compute statistics on the one way delays, such as average value and average deviation, much like they compute those statistics today for round trip times. They can use these statistics for predicting round trip times on arbitrary combinations of “up” and “down” paths.

Of course, the uncertainty that we just discussed still applies. The time at which a packet was sent or the acknowledgement received is measured in “local” time, using the clock of the sender. The time at which the packet was received or the acknowledgement sent is measured in “remote” time, using the clock of the receiver. There may be an arbitrary difference between these two clocks, but this does not matter if we are only using the one-way delays to predict round trip time or to track variations over time.

Without timestamps, we need approximations

It is still possible to assess one-way delays in the absence of timestamps, but it gets more complicated. Suppose for example that at time “t”, a node receives on path 2 an acknowledgement of a packet sent on path 1. It obtains a measurement of R12, but it would like to assess the “hidden variables” D1u and D2d. We can write that as a problem of estimating the values from noisy measurements. These variables cannot be directly computed from a single measurement, but they can be estimated from a series of measurement.

When we receive a measurement M for a variable like R12, we already have some knowledge of the value probability for D1u and D2d, such as average and variance. That means that we can compute the most likely value of the measurement, the one that has the maximum combined probability of (D1u=d, D2d=M-d). The measurement can also be used to refine the estimate of these probabilities.

The general solution requires interesting statistical processing, but we could make some simplifying hypotheses. For example, if we made a Gaussian hypothesis, we could estimate the one-way delays and their covariance using a Kalman filter. But even that might be too complicated for transport protocol implementations that typically avoid floating point computation, let alone the resolution of linear systems such as Kalman filters. TCP implementations model round trip times using exponential filters to compute an average RTT and an average deviation. It is a big jump from that to a full fledged Kalman filter.

In the picoquic implementation, I programmed a very simplified model that just uses TCP level math. Basically:

  • Upon a measurement m
    • If present, use timestamps compute u and d such as u+d = m
    • If not, use the average and deviation to approximate u and d:
      • predicted_m = average(u) + average(d)
      • delta = m – predicted_m
      • u = delta*deviation(u)/(deviation(u) + deviation(d))
      • d = m – u
    • Update the averages and deviation of d and u using exponential filters, just like TCP

I will leave initial conditions as an exercise for the reader. And I will leave to another blog the details on how exactly to use these delays for loss recovery and congestion control.

Posted in Uncategorized | Leave a comment

How many packet number spaces for QUIC Multipath?

I authored two drafts proposing two different solutions for Multipath QUIC: QUIC Multipath Negotiation Option; and, in collaboration with colleagues at Ali Baba, Multipath Extension for QUIC. Apart from some details that could easily be aligned, the main difference is that the “negotiation option” maintains the property of QUIC Transport to have a single packet number space for all application packets while the “multipath extension for QUIC” specifies that there will be a specific packet number space for each path. The implementation in picoquic described in a previous blog supports both options.

Supporting different number spaces per path is significantly harder to implement than keeping the single number space design adopted so far in QUIC. In my code, I use two configuration flags for the simple version and the version that uses multiple number spaces. The simple flag is referenced 13 times, the multiple number space flag 50 times. If I remove the initial declaration, the test code, the logging code and the negotiation of transport parameters, the simple flag is used in a total of 6 tests, but I find tests of the multiple number space flag at 41 places. All 6 tests that reference the simple flag also reference the multiple number space flag, for the following functions:

  • Select the “multipath” version of the packet scheduler to find which path to use for the next packet – 1 test.
  • Perform the “renew connection ID” logic for all paths, not just the “default” path used in “single path QUIC” – 2 tests.
  • Allow deletion of any failing path, instead of using special protection for the “default” path – 1 test.
  • Make the handling of the retransmit timer a bit more aggressive for multipath – 1 test.
  • Do not close the connection after retransmission failure if other paths are available – 1 test.

The other 35 flag tests are required to manage one name space per path. When packets are received, the acknowledgements are prepared in a “per path” structure instead of the “per application context” structure, the format of the acknowledgement frames changes, and the code must take care to send acknowledgements for all number spaces. Similarly, when packets are sent, the code needs a “per path” structure to keep track of packets in flight, manage retransmissions, and detect spurious retransmissions.

qvis traces of multipath session with single number space, showing expected packet sequence

In single path QUIC, the packet numbers are monotonically increasing and can be used as a pseudo clock. Removing that property has interesting consequences, as shown in the messed up representation of QUIC logs mentioned in a previous blog. (Using a single sequence number results in simpler visualizations, as seen in the picture above.) In many places, tests on sequence numbers may be replaced by tests on timestamps, and in other per sequence space numbers should be used. For packet encryption, since per path packet numbers are not unique, a 96 bit packet number is computed by concatenating path-ID and packet number, and is then used to guarantee uniqueness of the AEAD nonce. This extra code required a parallel push request to the Picotls library. Adding per path sequence number does add lots of code complexity.

The additional complexity has a direct implication on test coverage. Most of the code paths used in the single number space variant are also exercised by the single path tests. In that variant, additional tests are only required for multipath execution. In contrast, the multiple number space variant creates new code paths even for single path scenarios, such as for example 0-RTT transmission. To maintain test coverage, I will need to repeat most single path tests with the multiple number space option enabled.

The main argument for adding number space per path is efficiency: efficiency of the packet encoding, and specifically of the encoding of acknowledgements; and efficiency of the loss recovery processes. If packets are numbered “per path”, we can then adopt an unmodified per path version of the RACK algorithm used by QUIC for packet loss detection. That is true, and this is an area in which the simple multipath option requires specific code changes. When detecting packet losses, the RACK algorithm compares the packet’s sequence number and send time to those of the last packet acknowledged. If the packet was sent before the most recently acknowledged packet but has not been acknowledged yet, that packet was either lost or delivered out of order. RACK checks both time differences and sequence number differences against thresholds, and if those thresholds are exceeded the packet can safely be considered lost.

If the paths have different latencies or experience different delays, the simple multipath option will result in out of order deliveries, with time differences and sequence number differences that can widely exceed what RACK expects. In my implementation, I fixed that by keeping track of the path over which a packet was sent, and of the last packet acknowledged for each path. When deciding whether a packet can be lost, the code compares times and sequence numbers to the last packet acknowledged for the path, not to the last packet acknowledged overall.

The other potential issue with the simple multipath version is the size of the acknowledgement packets. Out of order deliveries translates into many “holes” in the list of packets received. QUIC acknowledgement frames can encode multiple ranges of received packets, but there are limits to how many ranges we want to encode per acknowledgement. Performance functions like UDP GSO mitigate that somewhat, as the code will send trains of consecutive packets from the same path, reducing the overall number of ranges. However, there might still be a problem when there is a large discrepancy between the latencies of the two paths, for example when balancing traffic between a terrestrial and a satellite link.

The extent of that problem is not clear. The current test suite includes the case of a 10 Mbps satellite link in parallel with a 1 Mbps terrestrial link, and the results for both multipath variants are similar. It is possible that different test configurations will produce different results, but at this stage it is speculation. It is also possible that simple tweaks in the transmission of acknowledged ranges or in the scheduling of packets would solve the issue. There will be some added complexity, but I believe this added complexity will be much less than the overall complexity of managing multiple packet number spaces. That same complexity will also be required in the number per path version if we want to support zero-length connection ID, which is an important feature.

In QUIC, the length of the connection ID is chosen by the destination node. Servers generally require use connection IDs that are long enough to manage proper load balancing in a server farm. Clients that do not establish multiple connection to the same server do not need that, because they can identify the connection context from the combination of source and destination addresses and ports. Such clients can decide to use zero-length connection identifiers. There will be less overhead in packets received from the server, resulting in slightly better performance. Imposing the use of full connection ID in multipath communication will bring that overhead back, resulting in a performance trade-off.

In conclusion, we have a trade-off. Managing separate number spaces for each path increases the complexity of the implementation but may result in slightly more efficient management of acknowledgements. Keeping a single number space for all spaces is much simpler but may require a few tweaks in the management of acknowledgements. The single number space variant supports use of zero length connection identifiers, while the separate number space variant does not. The multiple space variant requires adding a number of code paths that are not exercised by the “single path” tests, which has an effect on code safety. Overall, after implementing both options, I prefer the single-space variant but I am waiting to hear the opinions of other developers.

Posted in Uncategorized | Leave a comment

Logging and visualizing multipath QUIC connections

In the description of the multipath implementation in Picoquic, I mentioned that I struggled with the absence of multipath support in the “QLOG” format and the associated tools. Maybe the developers have been spoiled with too many good tools, but we came to rely on tools like “qvis” ( to visualize the logs of QUIC connections, showing for example nice graphical representations of the flows of packets. But those tools are not designed for handling packets sent over multiple paths, especially when packet number spaces are specific to each path. When I try to feed a multipath trace into QVIS, the result is a confused mess.

Example of mistmatched ladder diagram with lines going every which way

The image above shows what I get when using the “sequence” option in QVIS to the QLOG of a multipath connection. In theory, the vectors join the point in time when a packet was sent to the point in time when it was received, left to right for messages sent by clients, right to left for messages sent by servers. The algorithm evaluates sending and receiving time solely based on packet sequence numbers, and we see errors such as matching the sending of packet #0 on path 1 with the receiving of packet #0 on path 0. Similar issues happen when using the “congestion” options, with congestion signals for path 0 and path 1 conflated on the same sets of erratic lines.

I discussed this with Robin Marx, who developed the QLOG format and the tools. He suggested a simple modification to the QLOG format: when logging events in the log, document the number of the path as part of the event header. That seems like a very good idea, and I have since implemented it in the Picoquic code. It is also the first step towards better visualizations, such as plotting congestion control graphs separately for each path, or matching sending and receiving events per paths in the ladder diagrams. I trust that Robin Marx will soon fix that, and that we can go on easily using QLOG for debugging issues with QUIC multipath.

Posted in Uncategorized | Leave a comment

Estimating round trip and one way delays in multipath QUIC sessions

Yesterday, I was providing a broad brush description of the implementation of multipath QUIC in picoquic. Today, I will present a bit more details on a specific point, the measurement of transmission delays. Just like TCP, QUIC implementations use round trip time (RTT) estimates in many ways:

  1. For loss detection, RTT estimates are used to evaluate how long to wait for an acknowledgement before a packet is declared lost.
  2. Several congestion control algorithms (e.g., LEDBAT, VEGAS, HYSTART) use variations of the RTT above the minimum value to detect the beginning of congestion
  3. Other congestion control algorithms (e.g., BBR) use the minimal RTT to compute the minimal size of the congestion window for a target data rate
  4. ACK delays are often set as a fraction of the RTT

In TCP, or in “single path” versions of QUIC, the RTT is estimated by comparing the time at which a packet is acknowledged to the time at which it was sent. With multipath versions of QUIC this simple definition is insufficient, because the packets and their acknowledgements are not always sent on the same path. Consider for example the test case in which in which the first path is a low bandwidth and low delay “terrestrial” path, and the second path is a high bandwidth and high delay “satellite” path. Assume that the one-way delay is 10 ms on the terrestrial paths, and 300 ms on the satellite paths. Depending on the circumstances, we can experience different “round trip” times:

  • If the packet and the acknowledgement are sent on the terrestrial path, the RTT will be 20 ms plus queuing delays,
  • If the packet and the acknowledgement are sent on the satellite paths, the RTT will be 600 ms plus queuing delays,
  • If one is sent on the terrestrial path and the other on the satellite path, the RTT will be 310 ms plus queuing delays.

For QUIC, the transport specification states that the RTT is measured “per path”, so we know that we should maintain separate statistics for the satellite and the terrestrial link. But even that simple specification is ambiguous. For example, the satellite path RTT will be computed as mixing some measurement of about 600ms and other measurements of about 310ms, leading to some confusion.

One way to remove that confusion would be to mandate always sending acknowledgements on the same path on which packets are received, but this is suboptimal. Receiving acknowledgement through the shorter path result in a shorter control loop, and thus better performance. Also, there is sometimes a need to send acknowledgements for packets received on a path that has since been abandoned. The alternative would be to not acknowledge these packets, but that would cause undue retransmissions.

In the picoquic implementation, I use timestamps as defined in the QUIC timestamp extension draft. Since each acknowledgement carries a timestamp, the code can estimate the two one-way delays: from send time to time stamp one way, from time stamp to acknowledgement time the other way. With that, the code can maintain one way delay estimates for each path and each direction, and also estimates of the delay deviation for each one-way path. When time stamps are not available, the code use the estimate of delay deviations to apportion delays to each path, using formulas similar to those of a Kalman filter.

In practical deployments there is a degree of uncertainty in these measurements because time stamps are not derived from synchronized global clocks. The code must deal with a phase uncertainty and with a possible clock skew, as explained in the timestamp extension draft. In practice, the phase uncertainty is not a problem, because delay-based decisions either only consider variations in one way delays, or the sum of one way delays in both directions. The phase uncertainty does not affect the measurement of variations, and it disappears when adding measurements from each direction. Clock skew is a bigger problem, but it mostly affect the few connections that last a long time. The solution is probably to implement a simplified version of the NTP algorithms. Another item for the to-do list.

Computing one-way delays requires getting series of delay measurements covering various combination of paths. This does not always happen naturally, for example in client-server scenarios in which the server sends data packets and the client mostly sends acknowledgements. By default, the server will not send acknowledgements of ack-only packets, and the client will not get enough measurements to maintain valid delay estimates. In the picoquic implementation, this is solved by frequently adding PING frames to acknowledgement packets.

Delay measurements are used in the code in a variety of ways. The picoquic implementation of HYSTART tracks the evolution of one-way delays during slow start. The bandwidth estimation in BBR measures the amount of data received between the time stamps at beginning and end of a measurement period. The delay measurements are also used to compute packet loss timers, acknowledgement delays, and reordering delays. In all these areas, the multipath implementation uses conservative tradeoffs. Packet loss timers are based on the worst case scenarios, combining the one-way delay on the sending path and the worse one-way delays of all possible acknowledgement paths. The acknowledgement delay is computed from the RTT of the fastest path. The initial value of reordering delays is set to fraction of the one-way delay of the sending path. Some tuning will certainly be needed in the future.

Posted in Uncategorized | Leave a comment