Failure modes and rollback · Agentic Workflows & PMS Integration

Every agent fails. Not "might fail" — fails, plurally, in production, in ways you did not anticipate when you wrote the system prompt. The discipline that separates operators who run agents successfully from operators who quietly shut them down after six months is not avoiding failure. It is recognizing failure modes early, classifying them, and having rollback procedures ready before they are needed.

The five failure modes I have seen in production

Rollback design

Rollback is not a single procedure; it is a tiered set. Tier 1: reversible at the action level — undo this specific PMS change in 1-2 clicks via the PMS UI, with a one-page runbook the front-desk team can follow without engineering involvement. Tier 2: reversible at the run level — replay the entire run with the human-decided correction, automatically. Tier 3: reversible at the workflow level — disable the agent for this workflow entirely, route incoming work to humans, until engineering can investigate. Every operator project should have all three.

Tier 1 is the workhorse; you use it 30-50 times per month even on a healthy agent. Tier 2 is for genuinely incorrect runs; 2-5 times per month. Tier 3 is the emergency brake; you may go six months without using it, and then use it twice in one week when something breaks.

The "kill switch"

Every agent in production must have a kill switch that any front-desk supervisor can pull in under 30 seconds, with no engineering escalation. Practically, this is a feature flag — a row in a database, a config in a control panel — that the agent reads at the start of every run. If the flag is off, the agent declines to run and routes the inbound work to humans. The kill switch is used rarely; the existence of the kill switch is used often, by an owner who wants to know what happens if the agent goes wrong.

Drift detection

Set up an automated check that runs every Monday: take the last 50 agent runs, score them against a small benchmark of "what the right answer would have been," and alert if accuracy drops below a threshold (typically 92-95% depending on workflow). The benchmark is just a spreadsheet of representative inputs and the expected outputs, maintained by the operations team. Drift detection catches the slow-degradation failure mode that no individual review will spot.