Short intro to "Grants4Companies"
Our project "Grants for Companies" won the first price in the competition "eGovernment Wettbewerb 2021". In this blogpost, we show you some details about the implementation.
Network offloading promises to increase performance (less latency) and reduce resource usage (CPU moves work to network card). Sadly that's not the whole story - to achieve these goals, race conditions to userspace are introduced, and the network card firmware assumes responsibility for the whole (complicated!) TCP/IP handling.
Here are some symptoms that we could trace back to a performance "feature" - network offloading. Please note that this list is not exhaustive; it's just what we observed, reproduced, and could diagnose.
With a "normal" TCP/IP stack in the kernel (Linux, Windows, ...), the flow to send data over an existing TCP/IP connection is typically
The contract here is that (unless explicit zerocopy sending is requested) userspace is free to re-use the buffers as soon as the kernel call (#3) returns (#4), which is ensured as the Kernel got its own copy.
When network offloading is active (eg. tcp-segmentation-offload, tx-checksumming), then
The problem is that userspace may modify the buffers after step 4 -- but, and that is the tricky part, depending on the network load the hardware might be delayed in doing point 5! So most of the time (read: when running functionality tests) everything works as expected, but as soon as you generate more network load (often in production...) things go haywire. For the userspace application there's no hint whether or not hardware finished processing!
The overall result is that a mix of old/new data bytes gets sent out; if you're lucky, a higher-level protocol (TLS in userspace, or some application processing eg. during backups) provides an independent checksum that can detect the corruption and at least breaks the connection. If you're unlucky, you process wrong data.
In theory, the kernel could wait until the hardware fetched the data and only then returns to userspace; but this wouldn't improve latency, just trades some CPU cycles for waiting for IO to complete.
When a receiver processes data slower than the sender can provide it, a so-called TCP zerowindow can tell the producer to stop pushing packets.
Sadly, not all network adapters understand that in their TCP offloading firmware; this means that the sending adapter (which sees no acknowledge-frames, as they aren't sent) just keeps on repeating the same packets! In one of our tests this amounted to an additional 50% traffic.
Ironically this bug is a kind of misfeature - prematurely sending out frames before the receiver asks for them causes some of them coming in just at the right time, so this actually can make data transmission faster by avoiding some round-trip delays. Whether this is worth the additionally required bandwidth depends on the load of your line, whether you pay by data amount or by time span, the net time savings, etc.
Depending on the actual paths that network packets travel across, they might encounter different transmission lines with different characteristics. One such characteristic is the MTU, ie. maximum packet size (including all headers). Now, TCP has the MSS option value to tell about the local MTU -- but network paths may change during lifetime of a TCP connection, and some traversal protocols (IPSec, VPN tunnels, PPP, ...) may require smaller packets and/or additional header bytes, reducing the space for actual data payload.
To avoid (a pessimistic) manual setup on all machines, there's Path Maximum Transmission Unit Discovery to dynamically adjust the MTU down as needed. These network packets are sometimes dropped by a firewall - or, and here we get back to offloading, the adapter firmware just doesn't implement PMTUd (or the feature is broken because it wasn't tested). Sadly, for TCP connections that set the Don't fragment bit in the IP header this detail that has to be correctly implemented (unless you're only using a few hosts that you manually manage the MTU on), or your TCP connections will hang.
And yes, this is again load-dependent: under low network load it may work fine, but with higher network load (so packets get pushed to hardware after a longer delay) and/or packet retransmissions an adapter may decide that it can create bigger segments than provided by userspace or the kernel - but if there's a gap between actual PMTU and the adapters expectations the bigger packet will get dropped somewhere along the line, and with buggy PMTUd the only possibility left is to run into TCP timeouts...
You have a few options; the best one is simply disabling network offloading.
Another possibility is to switch the network adapter and/or driver; while that sounds stupid, in VMWare environments it might be as simple as using a NE2000 emulation instead of VMXNet3 (though that will reduce performance) for the VM.
Also, you can reduce the probability of these problems by spreading out the network load via (on Linux) netem delay and/or qdiscs.
If you wrote the software yourself, don't reuse network buffers immediately.
I'd be interested in hearing about your network troubleshooting; any other feedback is welcome too, of course.