Speeding up long-distance vMotion

Here’s the situation I was facing. Two Vmware vCenter 7.0.3. clusters on opposite sides of the country, connected via a a 10gig point to point ethernet circuit, and that circuit tends to have a bit of packet loss (0.5%-1%) under heavy utilization. Being cross country, it of course also has latency, typically in 55ms RTT range.

Vmware supports long distance vMotion. They claim you can do it with latency as high as 150ms. I found that to be the case, but whether a zero-loss site to site VPN across the internet, or across this high speed point to point link, I was never able to get vMotion to average more than a few hundred megabits. That is of course complete garbage if you’re trying to move terabytes of VM’s without downtime. If they’re frequently changing, and the rate of change exceeds the speed at which you can move data, you may even find vMotion impossible to use.

I found the above was due to the packet loss. I’ve done the same tests across a proper 10gig wave between locations, with the same latency but without the lossiness. On those I’d see a gigabit or two, but still far from full utilization. I suspect I could get that up to several gigabits after what I’ve learned and will share here, I just didn’t go back to test.

The cause of the problem is the vMotion (hot migrations) and provisioning (cold migrations) network stacks are based on decades out-of-date TCP configurations. They use a choice of two ancient congestion control algorithms, New Reno or Cubic, and both of those treat loss as congestion. When they see loss, they overreact and greatly reduce the TCP window size, causing your throughput to nosedive, then they take an excruciatingly long time to open it back up, and chances are another bit of loss will occur, compounding the issue.

Across this same lossy link, I did testing between two linux nodes set to use the BBR congestion control algorithm, and some other minor tuning I found at https://fasterdata.es.net/host-tuning/linux/test-measurement-host-tuning/:

net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864
net.ipv4.tcp_mtu_probing = 1
net.core.default_qdisc=fq
net.ipv4.tcp_congestion_control=bbr

With that config in place, I can frequently push 7+ Gbps with a single TCP session, where with the default linux congestion control of cubic, and default settings, I’d be lucky to average more than a gigabit.

Vmware doesn’t allow you to tinker with congestion control or windowing / buffer settings. So my workarounds for this were focused mostly on multi-stream vmotion, after reading pages like these:

I tinkered extensively with the number of vmotion vmkernel ports defined, and custom values in the following settings:

Migrate.NetExpectedLineRateMBps
Migrate.VMotionStreamHelpers
Net.TcpipRxDispatchQueues
Migrate.BindToVmknicCode language: CSS (css)

I probably went through about 30 permutations of settings, and the one that I finally arrived at was:

Two target vmkernel ports
Five source vmkernel ports

Migrate.NetExpectedLineRateMBps = 2000
Net.TcpipRxDispatchQueues = 5
Migrate.BindToVmknic = 2
Migrate.VMotionStreamHelpers = 48

I know that sounds kind of insane to have 48 stream helpers, but I went all the way to 64 in +8 jumps and 48 had the highest throughput for my 16 pCore source vMotion server. With the above config, 55ms 10gig circuit with ~1% loss, I was able to do cross country live vmotions of both VM state and storage at upwards of 2.4 Gbps. I did start out with more vmkernel ports on the target, but for whatever reason, while this config allows for 48 TCP sessions spread evenly across the five source vmkernel ports, they’d still only target two unique IP addresses, so the other three served no purpose.

This above config is also particularly useful if you have a Cogent point to point / metroE circuit between locations, because those are often rate limited to around 2 gbps per TCP session even if you’re paying for a full 10gig circuit. Since vMotion spreads the flows evenly across all the TCP sessions, this will let you better utilize the link with no one session hitting the rate limit and being throttled.

Next up, I found a second way to get around this issue which could be of interest to those who want more of a bandaid fix to evacuate one cluster for another; i.e. not permanent. I happened across this reddit thread where u/LatinSuD reported having similar issues solved by the use of socat to proxy the TCP session to the remote host. This was intriguing for me because I know my lossy link, in the hands of a modern TCP tuning, is capable of nearly wire speed. I tested it out and it worked. With only crude tuning, I was able to achieve 4 Gbps of vMotion between the same hosts and across the same link, where I could only get 2.4 Gbps with native tuning options. There’s probably more to go too if socat were running on a more modern system, but I have it on an old four core Xeon E3-1225 dating to 2011, so truly ancient hardware. The system has a Mellanox Connect-X 3 10gig NIC.

Here was my config:

  • Source ESXi host vmotion vmkernel interface: 10.88.0.9/24
  • Source ESXi host vmotion network default gateway override: 10.88.0.1
  • Target ESXi host vmotion vmkernel interface: 10.99.0.9/24
  • socat proxy linux system vmotion interface: 10.88.0.1
One target vmkernel port
One source vmkernel port

Migrate.NetExpectedLineRateMBps = 2000
Net.TcpipRxDispatchQueues = 5
Migrate.BindToVmknic = 2
Migrate.VMotionStreamHelpers = 5

The above config results in the source vmotion server wanting to talk to port 8000 on the target 10.99.0.9. However, since it’s been given a default gateway (you could also add a custom static route if needed) of the socat server, it’s going to send its packets to that system regardless.

On the socat server, I alter the packets with a destination NAT rule so it delivers those packets to itself instead of discarding or attempting to route them (if forwarding were enabled):

iptables -t nat -A PREROUTING -p tcp -d 10.99.0.9 --dport 8000 -j DNAT --to-destination 10.88.0.1:8000
Code language: CSS (css)

I also have the earlier sysctl settings in place, and most importantly, BBR congestion control:

net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864
net.ipv4.tcp_mtu_probing = 1
net.core.default_qdisc=fq
net.ipv4.tcp_congestion_control=bbr

Make sure no CPU throttling as socat is cpu-bound:

cpupower frequency-set -g performanceCode language: JavaScript (javascript)

Now start socat up to forward the received packets to the vmotion target:

socat TCP4-LISTEN:8000,fork,reuseaddr,bind=10.88.0.1 TCP4:10.99.0.9:8000

With the above config, the source vmotion server will use five TCP sessions. Its packets are intercepted by the socat server, rewritten to have a target of the local interface IP, socat is listening and accepts them, five socat children are spawned and connected to the target vmotion interface. It then forwards them as part of a new proxied TCP session to the target vmotion interface. Because of the proxying, the TCP conversation between source and socat occur on the local fast and non-lossy network, and the TCP conversation between the socat server and the target vmotion uses the BBR algorithm and a tuned TCP stack to achieve massively improved throughput. Like I said, old server with a 12 year old CPU and many years old NIC, still saw 4+ Gbps on this lossy link.

Leave a Reply

Your email address will not be published. Required fields are marked *