DRP and HA: what real tests reveal, not slides

In many organizations, DRP and HA are treated as synonyms. In production, they're not. A solid local HA can coexist with a fragile DRP.

On a B2B SaaS engagement, everything seemed "under control": stable cluster, clean monitoring, validated documentation. The first full failover exercise told a different story.

What broke down

Not a spectacular failure. A series of small gaps, compounded:

DNS dependency underestimated in the recovery sequence,
application validation arriving too late,
escalation roles known in theory, blurry under real-time pressure,
detailed runbook that was hard to execute under stress.

This is the point I emphasize in workshops: a long runbook is not a robust runbook.

Declared RTO vs measured RTO

The target RTO was ambitious and credible on paper. In the actual test, full recovery took almost three times longer.

Why?

the platform restarted quickly,
dependent services, however, restarted in a different order than anticipated,
business validation extended the decision window.

In other words: the infrastructure wasn't the main bottleneck. The dependency chain was.

The organizational issue behind the technical issue

When a DRP is rarely exercised, coordination quality degrades before technical quality does.

You see it in incidents:

decisions delayed due to unclear authority,
conflicting arbitrations between infrastructure and application teams,
operational fatigue pushing people to "hold on" instead of rolling back cleanly.

The problem extends beyond tooling into operational governance.

What actually improved resilience

Not a complete overhaul. Targeted adjustments.

Non-nominal test scenarios, quarterly, with measured reports.
Shortened, action-oriented runbooks.
Explicit rollback criteria agreed upon in advance.
Application validation integrated early in the sequence, not at the end.

We also added a simple rule: if the situation is ambiguous beyond a fixed timebox, execute the return plan. No extended debate in the crisis room.

A point that sometimes makes people uncomfortable

Some teams prefer to avoid full tests to avoid "taking risks." In reality, they're just shifting the risk to the first real incident.

Testing doesn't create fragility. It reveals it.

Position

A mature DRP isn't a well-maintained PDF. It's a repeatable capability, under constraint, with an imperfect team and an imperfect context.

If this level isn't reached, it's better to slow down certain migration workstreams. It's less elegant in committee, but far more responsible in production.