Having fun and surprises with IPv6

(Corrected March 4, 2018)

I am participating in the standardization of the QUIC protocol. That’s why I am writing a prototype implementation of the new transport protocol: Picoquic. And the development involves regular testing against other prototypes, the result of which are shown in this test result matrix. This is work in progress on a complex protocol development, so when we test we certainly expect to find bugs and glitches, either because the spec is not yet fully clear, or because somewhat did not read it correctly. But this morning, I woke up to find an interesting message from Patrick McManus, who is also developing an implementation of QUIC at Mozilla. Something is weird, he said. The first data message that your server sends, with sequence number N, always arrives before the final handshake message, with sequence number N-1. That inversion appears to happen systematically.

We both wondered for some time what kind of bug this was, until finally we managed to get a packet capture of the data:

And then, we were able to build a theory. The exchange was over IPv6. Upon receiving a connection request from Patrick’s implementation, it was sending back a handshake packet. In QUIC, the handshake packet responds to the connection request with a TLS “server hello” message and associated extensions, which set the encryption keys for the connection. Immediately after that, my implementation was sending its first data packet, which happens to be an MTU probe. This is a test message with the largest plausible length. If it is accepted, the connection can use message of that length. If it is not, it will have to use shorter packet.

It turns out that the messages that the implementation was sending were indeed a bit long, and they triggered an interesting behavior. My test server runs in a virtual machine in AWS. AWS exposes an Ethernet interface to virtual machines, but applications cannot use the full Ethernet packet length, 1536 bytes. The AWS machinery probably use some form of tunneling, which reduces the available size. The test message that I was sending was 1518 bytes, but that was still too long which is larger than the 1500 bytes MTU. Some router on the path, probably AWS, The IPv6 fragmentation code in the Linux kernel splits it in two: a large initial fragment, 1496 byte long, and a small second fragment 78 bytes long. You could think that fragmentation is no big deal, since fragments would just be reassembled at the destination, but you would be wrong.

Some routers on the path try to be helpful. They have learned from past experience that short packets often carry important data, and so they try to route them faster than long data packets. And here is what happens in our case:

  • The server prepares and send a Handshake packet, 590 bytes long.
  • The server then prepares the MTU probe, 1518 bytes long.
  • The MTU probe is split into fragment 1, 1496 bytes, and fragment 2, 78 bytes.
  • The handshake and the long fragment are routed on the normal path, but the small fragment is routed at a higher priority level.
  • The Linux driver at the destination receives the small fragment first. It queues everything behind that until it receives the long fragment.
  • The Linux driver passes the reassembled packet to the application, which cannot do anything with it because the encryption keys can only be obtained from the handshake packet.
  • The Linux driver then passes the handshake packet to the application.

And it took the two of us the best part of the day to explore possible failure modes in our code before we understood that. Which confirms an old opinion. When routers try to be smart and helpful, they end up being dumb and harmful. Please just send the packets in the order you get them!

More explanations, after 1 day of comments and further inquiries

It seems clear now that the fragmentation happened at the source, in the Linux kernel. This leaves one remaining issue, the out of order delivery.

There are in fact two separate out of order delivery issues. One is having the second fragment arrive between the first one, and the second is having the MTU probe arrive before the previously sent Handshake packet. The inversion between the segments may be due to code in the Linux kernel that believes that sending the last segment first speeds up reassembly at the receiver. The inversion between the fragmented MTU probe and the Handshake packet has two plausible causes:

  • Some router on the path may be forwarding the small fragment at a higher priority level.
  • Some routers may be implementing equal cost multipath, and then placing fragmented packets and regular packets into separate hash buckets

The summary for developers, and for QUIC in particular, is that we should really avoid triggering IPv6 fragmentation. It can lead to packet losses when NATs and firewalls cannot find the UDP payload type and the port numbers in the fragments. And it can also lead to out of order delivery as we just saw. And for my own code, the lesson is simple. I really need to set up the IPv6 Don’t Fragment option when sending MTU probes, per section 11.2 of RFC 3542.

 

About Christian Huitema

I have been developing Internet protocols and applications for about 30 years. I love to see how the Internet has grown and the applications it enabled. Let's keep it open!
This entry was posted in Uncategorized. Bookmark the permalink.

12 Responses to Having fun and surprises with IPv6

  1. J Weiss says:

    Could this be just the result of equal cost multi path routing ? The fragment can not
    be matched to the udp flow, as it does not contain udp port numbers. So it may take
    another path.

    • Yes, that’s certainly possible. The payload type of the fragments is set to “fragmentation” instead of UDP, so fragments and UDP packets may very well end up being routed on different paths.

  2. Eric Biederman says:

    IPv6 should not fragment by default. You might look for ip options to control the fragmentation behavior.

    I know linux when it fragments packets sends them out in reverse order as there are efficiencies to be gained both in fragmentation and reassembly. So the tail fragment may be ahead of the first fragment for that reason.

    • You may be unto something. We are still investigating whether the fragmentation happens in the Linux sender or somewhere in the network. Any pointer to the documentation of the IPv6 option?

      • Ben Hutchings says:

        IPv6 routers are not allowed to fragment packets. Fragmentation should only be done by the sending host.

  3. Robert Edmonds says:

    Is this really saying that an unfragmented IPv6 packet was sent by the source host, and a fragmented IPv6 datagram was received by the destination host?!

    Also, it’s not totally clear what layers are being included in the byte counts. Are you including Ethernet headers? An untagged 1518 byte Ethernet frame would somehow have 1504 bytes of IP payload? Unless the 1518 bytes includes a VLAN tag and 1500 bytes of IP payload?

  4. In IPv6, fragmentation is end-to-end. Intermediate routers do not fragment or reassemble packets.

    Since QUIC is essentially reinventing TCP, you should probably do what TCP does and ask the socket API to set the DF (Don’t Fragment) bit.

    That way, you will (hopefully) receive an ICMP Packet Too Big response, telling you what the MTU is, and you can resize your subsequent packets accordingly.

    • Yes. On the other hand, QUIC is an encrypted transport, and ICMP too big are clear text packets that can easily be spoofed. So we have to strike a balance between converging quickly with ICMP, and getting trusted information from encrypted acknowledgements.

  5. romanrm says:

    > When routers try to be smart and helpful, they end up being dumb and harmful. Please just send the packets in the order you get them!

    Please, don’t design a dumb and harmful protocol which breaks when others do something they just happen to do normally. Out of order receipt is a thing to be expected in a global network, if your QUIC breaks when it happens, then it’s back to the drawing board for **YOU**, not for someone to magically to redesign all routers along the path to babysit your brittle crap.

    • QUIC does of course tolerate out of order delivery in general. But when that happens during the handshake, it becomes problematic because the receiver does not yet have the keys needed to decrypt the out of order packet. Of course, it could queue the packets for some time until the handshake packet arrives, but that opens a denial of service vulnerability.

      But yes, the general design is to work on the network we do have, not the network we wish we had. So if some packets patterns are shown to trigger unwanted network behavior, we will try avoid them. For example, if measurements show that packets of size less than X get delivered out of order, we will probably just pad packets to size X+P. Or, if we observe intermediate boxes doing strange stuff with information in the packet, then we will encrypt that.

  6. snarly neighbor says:

    Seems ironic to me that this whole issue and discussion is posted on an IPv4-only website. It would be great if you move this blog someplace that actually provides IPv6…wordpress does *not*.

Leave a comment