A standard incident response procedure: detect, mitigate, recover, retrospect.
Incident response balances speed and traceability. Every mitigation and autonomous-agent decision is logged in a decision trail so it is reproducible during retrospectives.
- 1Detect: classify scope and severity immediately from SLO alerts.
- 2Mitigate: apply reversible actions first—traffic shedding, rollback.
- 3Recover: confirm root cause and deploy the permanent fix.
- 4Retrospect: write the RCA from the timeline and decision logs.
Temporary symptom-masking actions must be converted into root fixes during retrospectives. The principle is to scan the full pattern and fix it in one batch.