15 Years of Network Offloading Trouble

Network offloading promises to increase performance (less latency) and reduce resource usage (CPU moves work to network card). Sadly that's not the whole story - to achieve these goals, race conditions to userspace are introduced, and the network card firmware assumes responsibility for the whole (complicated!) TCP/IP handling.

Here are some symptoms that we could trace back to a performance "feature" - network offloading. Please note that this list is not exhaustive; it's just what we observed, reproduced, and could diagnose.

Data corruption via race conditions

With a "normal" TCP/IP stack in the kernel (Linux, Windows, ...), the flow to send data over an existing TCP/IP connection is typically

Userspace prepares data (in one or more buffers)
Userspace passes pointers to the kernel (via POS*X write and writev; send, -to, -msg, -mmsg, and -file; splice, ...; WinSock2 send, sendto, ...)
The kernel copies the data to its own buffers (SO_SNDBUF, tcp_wmem)
The CPU returns to userspace context to continue processing
The kernel cuts the data into packets, prepends IP and TCP headers, and pushes it to hardware on a separate thread (using interrupts or polling)

The contract here is that (unless explicit zerocopy sending is requested) userspace is free to re-use the buffers as soon as the kernel call (#3) returns (#4), which is ensured as the Kernel got its own copy.

When network offloading is active (eg. tcp-segmentation-offload, tx-checksumming), then

Userspace prepares data
Userspace passes pointers to the kernel
The network driver receives pointers to userspace memory and appends the (resolved) physical addresses to queues in the network hardware
The CPU returns to userspace context
The hardware fetches data via DMA, does the required processing (cut to packet size, headers, ...), and sends it on

The problem is that userspace may modify the buffers after step 4 -- but, and that is the tricky part, depending on the network load the hardware might be delayed in doing point 5! So most of the time (read: when running functionality tests) everything works as expected, but as soon as you generate more network load (often in production...) things go haywire. For the userspace application there's no hint whether or not hardware finished processing!

The overall result is that a mix of old/new data bytes gets sent out; if you're lucky, a higher-level protocol (TLS in userspace, or some application processing eg. during backups) provides an independent checksum that can detect the corruption and at least breaks the connection. If you're unlucky, you process wrong data.

In theory, the kernel could wait until the hardware fetched the data and only then returns to userspace; but this wouldn't improve latency, just trades some CPU cycles for waiting for IO to complete.

Wasted bandwidth

When a receiver processes data slower than the sender can provide it, a so-called TCP zerowindow can tell the producer to stop pushing packets.

Sadly, not all network adapters understand that in their TCP offloading firmware; this means that the sending adapter (which sees no acknowledge-frames, as they aren't sent) just keeps on repeating the same packets! In one of our tests this amounted to an additional 50% traffic.

Ironically this bug is a kind of misfeature - prematurely sending out frames before the receiver asks for them causes some of them coming in just at the right time, so this actually can make data transmission faster by avoiding some round-trip delays. Whether this is worth the additionally required bandwidth depends on the load of your line, whether you pay by data amount or by time span, the net time savings, etc.

MTU (again...)

Depending on the actual paths that network packets travel across, they might encounter different transmission lines with different characteristics. One such characteristic is the MTU, ie. maximum packet size (including all headers). Now, TCP has the MSS option value to tell about the local MTU -- but network paths may change during lifetime of a TCP connection, and some traversal protocols (IPSec, VPN tunnels, PPP, ...) may require smaller packets and/or additional header bytes, reducing the space for actual data payload.

To avoid (a pessimistic) manual setup on all machines, there's Path Maximum Transmission Unit Discovery to dynamically adjust the MTU down as needed. These network packets are sometimes dropped by a firewall - or, and here we get back to offloading, the adapter firmware just doesn't implement PMTUd (or the feature is broken because it wasn't tested). Sadly, for TCP connections that set the Don't fragment bit in the IP header this detail that has to be correctly implemented (unless you're only using a few hosts that you manually manage the MTU on), or your TCP connections will hang.

And yes, this is again load-dependent: under low network load it may work fine, but with higher network load (so packets get pushed to hardware after a longer delay) and/or packet retransmissions an adapter may decide that it can create bigger segments than provided by userspace or the kernel - but if there's a gap between actual PMTU and the adapters expectations the bigger packet will get dropped somewhere along the line, and with buggy PMTUd the only possibility left is to run into TCP timeouts...

Workarounds, a.k.a. "fixing the problem"

You have a few options; the best one is simply disabling network offloading.

On Linux, use ethtool with "-k" to see what's available and the configuration, and "-K" to change settings.
For Windows there's a good overview in a Wireshark documentation page.
On *BSD, ifconfig shows offload settings in the options line.

Another possibility is to switch the network adapter and/or driver; while that sounds stupid, in VMWare environments it might be as simple as using a NE2000 emulation instead of VMXNet3 (though that will reduce performance) for the VM.

Also, you can reduce the probability of these problems by spreading out the network load via (on Linux) netem delay and/or qdiscs.

If you wrote the software yourself, don't reuse network buffers immediately.

Send in your anecdote

I'd be interested in hearing about your network troubleshooting; any other feedback is welcome too, of course.

Live Hack: Controlling a Smartphone via Laser

In the USENIX Security Symposium 2020 "Laser-Based Audio Injection on Voice-Controllable Systems" was presented: making MEMS-microphones believe that audio input happens via amplitude-modulated light from far away. We reproduced these efforts, to show that such threats are legit and shouldn't be underestimated.

more Live Hack: Controlling a Smartphone via Laser

15 Years of Network Offloading Trouble

Data corruption via race conditions

Wasted bandwidth

MTU (again...)

Workarounds, a.k.a. "fixing the problem"

Send in your anecdote

author

Philipp Marek

Short intro to "Grants4Companies"

Live Hack: Controlling a Smartphone via Laser