Multi-site VMware migration to Proxmox VE
Starting point
Three industrial production sites, ~250 VMs, including around twenty workloads with strict internal SLAs. The Broadcom renewal had been multiplied by 5 compared to the previous contract. The decision to migrate was made. What was missing: a method.
Commercial pressure pushed for starting fast. Field experience pushed for starting right.
:::Critical point The DRP existed on paper. But it had never been executed in a migration context. Starting without rehearsing it meant exposing production to an unmeasured risk. :::
Upfront audit: what we found
Before touching the first VM, two weeks of audit:
- Mapping of inter-site and inter-VM network traffic
- Inventory of storage dependencies (NFS, iSCSI, production snapshots)
- Analysis of operational scripts and failover procedures
- Partial DRP execution to measure real RTO
The RTO declared in internal SLA contracts was 15 minutes. The RTO measured during the partial test was 42 minutes. This delta changed the entire migration sequence.
:::Field observation The "available" maintenance windows according to the annual schedule did not match actual production constraints. Three of the five planned windows were pushed back when we consulted the business teams. :::
Target Proxmox architecture
Chosen architecture: multi-node Proxmox VE clusters per site, with cross-site replication for critical workloads. Ceph distributed storage for workloads that require it, local ZFS for workloads with lower resilience requirements.
Non-negotiable design principles:
- Automatic failover without human intervention for priority workloads
- Synchronous or asynchronous replication based on criticality (not a single architecture for everything)
- Independent management access per site in the event of a WAN outage
- No dependency on a single central component
Migration sequence
The migration was split into 7 waves over 14 weeks:
- Waves 1–2: development and test workloads — to validate V2V procedures, target network configuration, and operational runbooks
- Waves 3–4: support infrastructure (DNS, monitoring, backups) — the tools needed to operate the remainder
- Waves 5–6: secondary production workloads — first contact with real production load
- Wave 7: SLA-bound critical workloads — only after real RTO validation on previous waves
:::Decision retained Each wave ended with an RTO validation under real conditions before the next wave started. The initial schedule planned for 10 weeks. It took 14, due to two postponements requested by the business teams. It was the right decision. :::
VMware / Proxmox coexistence
The VMware environment remained operational until the end of wave 7. Real coexistence, not symbolic: critical workloads could be switched back if needed.
This decision had a cost (VMware licences paid for longer). It also had value: the production team accepted the migration because they never felt locked into a permanent "point of no return".
:::Production reality The rollback was never triggered for critical workloads. But it was triggered twice for secondary workloads. Those two activations proved the mechanism worked — and built confidence for what followed. :::
Result
Proxmox VE infrastructure operational across all 3 sites. Active cross-site replication. DRP documented with measured RTO below 8 minutes on critical workloads. 70% reduction in licensing costs from the first year post-migration. Operations team self-sufficient on routine procedures after a hands-on training session.