These series will cover some of the notes I’ve been gathering on networking performance over the last years. Some of those will be specific to Azure, but still useful for other environments.
Long gone are the days of having one 100Mbps or one 1Gbps NIC on your server. These days, even in the cloud, we require and enjoy much higher network speeds.
These performance requirements surely put a strain on other components such as the CPU. Many NICs -specially enterprise / server grade NICs- support offloading work from the CPU to themselves. This way higher speeds can be achieved without CPU bottlenecks, but…
How does it all work in a virtualised environment like the cloud?
We know that the NICs on the VMs (vNICs for short) are just more software, so any work done by software has to be actually done by the CPU, right? Does this impact the CPU resources on the physical host where the VMs are located? Well, of course it does.
However, the reality is you that can still offload work to the physical NICs on the host by using technologies like SR-IOV. Let’s have a look in more detail.
SR-IOV effectively enables the guest OS of a VM to use resources of the underlying physical NIC, thus bypassing the data-path between the guest OS driver and the vmSwitch running on the host OS.
On the figure above you can see the NetVSC in Child Partition 1 (vNIC) and NetVSP in the Parent Partition (vmSwitch) connected over the VMBus. This is all handled by software, so moving packets across this data-path requires CPU cycles.
On the other hand you can also see how NetVSC on Child Partition 2 (vNIC) bypasses the data-path towards the vmSwitch completely by sending directly the packets to the physical NIC, thus saving CPU cycles by allowing the physical NIC to take care of handling the packets. This is sent over a Virtual Function (VF). Think of a VF in this scenario as a comms channel between the physical NIC and the vNIC.
Pros and Cons of offloading
The most obvious reason to offload work to a physical NIC is to save CPU cycles, but how are this saved and how beneficial it really is? Are we going to lower our network-related CPU usage by how much exactly?
These are questions that can only be answered with benchmarking and comparisons in a case by case basis.
Identifying the right metrics
It depends not only in the amount of traffic (let’s say throughput) but also in other things such as the packets per second (PPS) or the (newly created) connections per second (CPS) rate.
Imagine an online platform for tens of thousands of brokers that need to stay updated of stock prices in near real time. If all the traffic for updating stock values goes through the same pipe, you are exposing your system to a rate of tens of thousands of PPS. This will be potentially small packets: IP and transport headers, some application headers and then a tiny payload of a few bytes only. So for packets of e.g. 100 Byte size and a PPS rate of 100.000 we are barely reaching a throughput of 10MByte/s and yet we could be putting a strain on our VM to the point of slowing down other processes and / or losing packets, just because for each packet we handle the whole data-path consumes a number of CPU cycles.
It is paramount that you identify your traffic profiles and the right metrics to understand how can you improve your infrastructure’s performance.
Dealing with high packets per second (PPS) rates
Assuming these updates are over long-lived TCP connections (a la websockets) or assuming we use UDP (and based on the 5-tuple our system identifies packets as belonging to the same stream) we have a scenario with a very high PPS rate but either a low or not relevant CPS rate. This is a scenario that can thoroughly benefit from offloading traffic to SR-IOV.
The advantages are as follows:
- Less CPU interruptions (on the physical host!) to handle packets.
- No impact on other processes that require CPU cycles.
- Latency is reduced as the packets are not getting in a software ring buffer waiting to be handled by the CPU.
- Jitter is also reduced because the physical NICs performance is more predictable / stable than the time a packet has to wait to be handled by a busy CPU.
Now imagine a second part of this platform where the brokers can use the platform to send operations. In this case we’re not dealing (generally) with a limited number of streams with a big rate of updates but instead we’re dealing with discrete operations that probably require at least a new connection each. For all we know, in this scenario with tens of thousands of brokers sending requests to our platform -maybe even in an automated fashion to our API- what we are facing is tens or hundreds of thousands of new connections per second. We may not really see the advantages of offloading in this second scenario. Why?
Dealing with high connections per second (CPS) rates
Let me first clarify that when we talk about CPS we actually refer to new connections created per second and that we just say connections per second or CPS for language economy reasons.
There is a reason why the traffic is designed to go through a vmSwitch between the physical network and the VM: Traffic filtering. In fact, Microsoft calls their Azure vmSwitch VFP, which is an acronym for Virtual Filtering Platform. As an example, when a packet has to be evaluated against a firewall rule (NSG in Azure, SG in other providers) it needs to go through the whole software path as per the diagram we have seen before. This does not only consume the expected CPU cycles to handle the packet around, but also requires CPU cycles to evaluate the packet’s metadata (5-tuple) against your NSG/SG ruleset and then apply whichever actions result from the evaluation (route, NAT, drop, forward, etc).
If this is how it really (roughly speaking ;-)) works, how come in the previous example we improved our CPU usage, latency and jitter with SR-IOV offloading? Let me explain it with a simple two-step process:
- Either after the initial packet (UDP) or a complete 3-way handshake (TCP) we add an entry into a flow table so as to not evaluate any further packets of that connection (i.e. 5-tuple). Note: This is why UDP hole punching is a thing.
- All the subsequent packets that match a flow in the table, as they arrive from either side (VM or external endpoint), are put directly into the fast path that bypasses the software data-path.
Knowing the above we can infer that if we have a requirement of a high rate of newly created connections per second (CPS) certain offloading features may not help us much or even not help us at all!
In this second scenario with a high CPS rate requirement we may want to scale out our platform. This way we can divide the CPS rate between more than one endpoint (usually one or more VMs behind a load balancer or following some other load balancing solution such as DNS load balancer).
Of course the major cloud providers have other sorts of improvements to keep pushing better performance and more stability in different scenarios. Microsoft has a very clear presentation about some of their improvements here (PDF from 2016, so a bit dated by now).
- Cloud networking architects should do the exercise of investigating and understanding the different traffic profiles their platform will have. Any other approach increases the risk of suffering reduced performance and/or outages.
- Cloud Service Providers do apply all the performance improvements we knew from the old physical networking world plus a lot of innovation around custom NICs and newer technologies.
Other topics we will cover in the series are some concepts relevant within guest OS such as RSS in multiple-core machines and DPDK for fast packet processing in users pace; as well as architectural reviews for globally available platforms from a more holistic point of view. Prompt and secure delivery at our edge -specially when it is ubiquitous- is as important as processing the data in our backends.