VMware audit before migration: what breaks clean plans

The committee version is almost always the same: "we have 240 VMs, a DRP, backups, we can plan the exit." That's reassuring. And often partially false.

On this engagement: three sites, industrial production, tight change windows. Initially, the IT leadership wanted a cost estimate and a timeline. After ten days of audit, the right question was no longer "when do we start," but "where is our point of no return."

What looked solid

On paper:

clean vCenter inventory,
documented DRP,
active monitoring,
application ownership "known."

In field workshops, it was more nuanced.

Services classified as "non-critical" were in fact part of production chains.
Middleware dependencies lived in operational scripts, not in the documentation.
The DRP assumed a recovery sequence that no longer matched the actual application behavior.

Nothing exotic. This is exactly what you find in environments that evolved quickly over several years.

The real problem wasn't VMware

The issue wasn't "VMware is bad" or "Proxmox is better." The issue was simpler: the migration decision was built on a partial technical truth.

When the dependency map is incomplete, the schedule becomes a hypothesis dressed up as certainty.

We saw it immediately:

two migration batches assumed to be independent shared a common network dependency,
some "green" backups had no recent proof of successful restore,
recovery runbooks referenced prerequisites that no longer existed in production.

What shifted the decision

The turning point wasn't a benchmark. It was an incident simulation workshop.

When we ran a failover scenario with ops and application teams together, three gaps emerged:

The rollback decision deadline was not defined.
Responsibility for application validation was unclear.
The announced RTO did not account for transverse dependencies.

At that point, the "migration speed" debate lost its priority. IT leadership asked for something different: smaller waves, longer coexistence, and a validated, versioned and traceable rollback matrix.

Decisions made

We settled on a less spectacular but more solid trajectory:

migration by business dependency, not cluster convenience,
some VMware components kept in parallel, temporarily,
targeted DRP validation before moving any critical workloads,
go/no-go criteria defined before each wave, not the day before.

Yes, this stretches the program. Yes, it costs more in the short term.

But it avoids the classic scenario: a "successful" go-live followed by three weeks of instability and decisions made under stress.

A field observation often forgotten

A migration doesn't always fail on day one. It sometimes fails a month later, when the operations team absorbs the side effects with incomplete runbooks and excessive cognitive load.

The risk isn't only technical. It's organizational.

who decides within 15 minutes whether to roll back or continue,
who validates functional recovery on the business side,
who maintains continuity if two incidents chain together.

If those answers aren't clear before the migration, they won't be clear during the incident.

Position

A useful VMware audit doesn't serve to justify a decision already made. It serves to make the decision robust.

Sometimes the audit confirms you can accelerate. Sometimes it forces you to slow down. In both cases, that's good news: you've left the marketing of trajectory and returned to the engineering of continuity.