Multi-site VMware migration to Proxmox VE

Starting point

Three industrial production sites, ~250 VMs, including around twenty workloads with strict internal SLAs. The Broadcom renewal had been multiplied by 5 compared to the previous contract. The decision to migrate was made. What was missing: a method.

Commercial pressure pushed for starting fast. Field experience pushed for starting right.

:::Critical point The DRP existed on paper. But it had never been executed in a migration context. Starting without rehearsing it meant exposing production to an unmeasured risk. :::

Upfront audit: what we found

Before touching the first VM, two weeks of audit:

Mapping of inter-site and inter-VM network traffic
Inventory of storage dependencies (NFS, iSCSI, production snapshots)
Analysis of operational scripts and failover procedures
Partial DRP execution to measure real RTO

The RTO declared in internal SLA contracts was 15 minutes. The RTO measured during the partial test was 42 minutes. This delta changed the entire migration sequence.

:::Field observation The "available" maintenance windows according to the annual schedule did not match actual production constraints. Three of the five planned windows were pushed back when we consulted the business teams. :::

Target Proxmox architecture

Chosen architecture: multi-node Proxmox VE clusters per site, with cross-site replication for critical workloads. Ceph distributed storage for workloads that require it, local ZFS for workloads with lower resilience requirements.

Non-negotiable design principles:

Automatic failover without human intervention for priority workloads
Synchronous or asynchronous replication based on criticality (not a single architecture for everything)
Independent management access per site in the event of a WAN outage
No dependency on a single central component

Migration sequence

The migration was split into 7 waves over 14 weeks:

Waves 1–2: development and test workloads — to validate V2V procedures, target network configuration, and operational runbooks
Waves 3–4: support infrastructure (DNS, monitoring, backups) — the tools needed to operate the remainder
Waves 5–6: secondary production workloads — first contact with real production load
Wave 7: SLA-bound critical workloads — only after real RTO validation on previous waves

:::Decision retained Each wave ended with an RTO validation under real conditions before the next wave started. The initial schedule planned for 10 weeks. It took 14, due to two postponements requested by the business teams. It was the right decision. :::

VMware / Proxmox coexistence

The VMware environment remained operational until the end of wave 7. Real coexistence, not symbolic: critical workloads could be switched back if needed.

This decision had a cost (VMware licences paid for longer). It also had value: the production team accepted the migration because they never felt locked into a permanent "point of no return".

:::Production reality The rollback was never triggered for critical workloads. But it was triggered twice for secondary workloads. Those two activations proved the mechanism worked — and built confidence for what followed. :::

Result

Proxmox VE infrastructure operational across all 3 sites. Active cross-site replication. DRP documented with measured RTO below 8 minutes on critical workloads. 70% reduction in licensing costs from the first year post-migration. Operations team self-sufficient on routine procedures after a hands-on training session.