In a populated compute cloud, there are several scenarios in which it’s beneficial to be able to rearrange VM guest instances into a different placement across the hypervisor hosts via migration (live or otherwise). These use cases typically fall into three categories:
- Rebalancing – spread the VMs evenly across as many physical VM host machines as possible (conceptually similar to vSphere DRS). Example use cases:
- Optimise workloads for performance, by reducing CPU / I/O hotspots.
- Maximise headroom on each physical machine.
- Reduce thermal hotspots in order to reduce power consumption; for example, back in 2003, HP showed that intelligent workload placement can reduce energy consumption by more than 14%. (Here’s an old slidedeck I made years ago when researching this use case.)
- Consolidation – condense VMs onto fewer physical VM host machines (conceptually similar to vSphere DPM). Typically involves some degree of defragmentation. Example use cases:
- Increase CPU / RAM / I/O utilization. CERN blogged last summer about this Tetris-like challenge, and it can be taken even further over-committing CPU/RAM and/or memory page sharing.
- Free up physical servers to reduce power consumption.
- Evacuation – free up physical servers:
- for repurposing or decommissioning
- for maintenance (BIOS upgrades, re-cabling etc.) or repair
- to protect SLAs, e.g. when monitors indicate potential imminent hardware failure, or when HVAC failures are likely to cause servers to shutdown due to over-heating. (This is different to the somewhat confusingly named
nova evacuatefunctionality which moves VMs to a new host after the server has already failed.)
Whilst one-shot manual or semi-automatic rearrangement can bring immediate benefits, the biggest wins often come when continual rearrangement is automated. The approaches can also be combined, e.g. first evacuate and/or consolidate, then rebalance on the remaining physical servers.
Other custom rearrangements may be required according to other IT- or business-driven policies, e.g. only rearrange VM instances relating to a specific workload, in order to increase locality of reference, reduce latency, respect availability zones, or facilitate other out-of-band workflows or policies (such as data privacy or other legalities).
In the rest of this post I will expand this topic in the context of OpenStack, talk about the computer science behind it, propose a possible way forward, and offer a working prototype in Python.
If you’re in Vancouver for the OpenStack summit which starts this Monday and you find this post interesting, ping me for a face-to-face chat!
VM placement in OpenStack: present and future
It is clear from the diversity of the use cases listed above that VM placement policies are likely to vary greatly across clouds, and sometimes even within a single cloud. OpenStack Compute (
nova) has fairly sophisticated scheduling capabilities which can be configured to implement some of the above policies on an incremental basis, i.e. every time a VM instance is started or migrated, the destination VM host can be automatically chosen according to filters and weighted cost functions. However, this approach is somewhat limited for migration, because the placement policies are only considered one migration at a time. Like in Tetris, more efficient arrangements can be achieved by thinking further ahead!
OpenStack clouds can already be segregated in various ways via cells, regions, availability zones, host aggregates, and server groups, and so there are various techniques available for implementing different placement policies to different areas of your cloud. For example you could have groups of hosts dedicated to CPU-intensive or I/O-bound workloads using an anti-affinity placement policy, and other groups dedicated to light workloads using an affinity placement policy. But these policies are static in nature (e.g.
nova filters and schedulers are configured by the cloud administrator in
nova.conf), which somewhat limits their flexibility.
OpenStack is still relatively new, however with Cinder and Neutron rapidly evolving, it is at the point where a VM’s network and storage dependencies can be live migrated along with the workload in a near seamless fashion. So now we should be able to start developing mechanisms for implementing more sophisticated placement policies, where not only is VM rearrangement performed automatically, but the policies themselves can be varied dynamically over time as workload requirements change.
VMware is not the (complete) solution
VMware advocates might instinctively recommend running vSphere clusters underneath nova, to harness vSphere’s pre-existing DRS / DPM functionality. However, there are still unresolved problems with this approach, as this article on
nova-scheduler and DRS highlights. (Funnily enough, I encountered exactly these same problems a few years ago when I was part of a team adding vSphere integration into the orchestration component of what has since become NetIQ Cloud Manager …) Besides, a vSphere cluster and an OpenStack cloud are very different beasts, and of course not everyone wants to build their cloud exclusively with VMware technology anyway.
The computer science behind cloud rearrangement
Unfortunately developing algorithms to determine optimal placement is distinctly non-trivial. For example, the consolidation scenario above is a complex variant of the bin packing problem, which is NP-hard. The following constraints add significant complexity to the problem:
- A useful real world solution should take into account not only the RAM footprint of the VMs, but also CPU, disk, and network.
- The algorithm needs to ensure that SLAs are maintained whilst any rearrangement takes place.
- If the cloud is too close to full capacity, it may not be possible to rearrange the VMs from their current placement to a more optimal placement without first shutting down some VMs, which could be prohibited by the SLAs.
- Even if the cloud is at low utilization, live migration is not always possible, e.g.
- the hypervisor may not support it (especially across non-identical CPU architectures)
- shared storage may be required but unavailable
- the network may not support it
- Even if the arrangement is achievable purely via a sequence of live migrations, the algorithm must also be sensitive to the performance impact to running workloads when performing multiple live migrations, since live migrations require intensive bursts of network I/O in order to synchronize the VM’s memory contents between the source and target hosts, followed by a momentary freezing of the VM as it flips from the source to the target. This trade-off between optimal resource utilization and service availability means that a sub-optimal final placement may be preferable to an optimal one.
- In the case where the hypervisor is capable of sharing memory pages between VMs (e.g. KSM on KVM), the algorithm should try to place together VMs which are likely to share memory pages (e.g. VMs running the same OS platform, OS version, software libraries, or applications. A research paper published in 2011 demonstrated that VM packing which optimises placement in this fashion can be approximated in polytime, achieving 32% to 50% reduction in servers and a 25% to 57% reduction in memory footprint compared to sharing-oblivious algorithms.
(There is prior art of course. For example, several years ago, Pacemaker implemented a best effort algorithm for VM host selection.)
Divide and conquer?
As noted by the 2011 research paper referenced above, this area of computer science is still evolving. There is one constant however: any rearrangement solution must not only provide a final VM placement optimised according to the chosen constraints, but also a sequence of migrations to it from the current placement. There will often be multiple migration sequences reaching the optimised placement from the current one, and their efficiency can vary widely. In other words, there are two questions which need answering:
- Given a starting placement A, which is the best (or near-optimal) final placement B to head for?
- What’s the best way to get from A to B?
The above considerations strongly suggest that the first question is much harder to answer than the second, although the two are not necessarily orthogonal; for example there could be two different final placements B and C which are equally near-optimal, but it may be much harder to reach C from A than B.
I propose that it is worth examining the effectiveness of a divide and conquer approach: solving the second question may simplify the first, and also provide a mechanism for comparatively evaluating the effectiveness of potential answers to the first. Another bonus of this decoupling is that it should be possible for the path-finding algorithm to also discover opportunities for parallelizing live migrations when walking the path, so that the target placement B can be reached more quickly.
A proof of concept algorithm for VM migration path-finding
I’ve designed an algorithm which solves the second problem, and hacked together a proof of concept implementation in Python. This video shows it solving three randomly generated scenarios:
It’s far from perfect, but as a proof of concept it seems to hold up fairly well under soak testing. I’m pleased to announce that the code is now on github: https://github.com/aspiers/migrationpaths and I would love to hear feedback on it.