Incident Response Runbook Template
Incident Response Runbook Template
Section titled “Incident Response Runbook Template”About This Template
Section titled “About This Template”Use this template to create runbooks for automated or semi-automated incident response. Each runbook should cover a single, well-defined incident type. Runbooks are executed by the operations automation layer; they are also the documentation that engineers use when automation is unavailable or the incident falls outside automation scope.
Runbook: [Incident Type Name]
Section titled “Runbook: [Incident Type Name]”Runbook ID: RB-[CATEGORY]-[NUMBER] (e.g., RB-BGP-01)
Category: [Routing / Connectivity / Management Plane / Hardware / Security / Performance]
Automation tier: [Tier 1 — Auto-remediate / Tier 2 — Propose and approve / Tier 3 — Alert only]
Last reviewed: [YYYY-MM]
Owner: [Team or role]
1. Incident Description
Section titled “1. Incident Description”What this covers: [One paragraph describing the incident type. Be specific — “BGP session down” not “routing problem”.]
Examples:
- [Specific example scenario 1]
- [Specific example scenario 2]
Explicitly out of scope: [Scenarios that look similar but are handled by a different runbook.]
2. Detection
Section titled “2. Detection”Primary alert: [Alert name / monitoring rule that triggers this runbook]
Alert conditions:
[Alert definition — threshold, duration, source]Example: BGP session state != Established for > 2 minutes on any peerSecondary signals: [Additional telemetry that supports the diagnosis — interface counters, system logs, SNMP traps]
False positive rate: [Low / Medium / High — and the conditions that cause false positives]
3. Automated Diagnostics
Section titled “3. Automated Diagnostics”Steps the automation layer runs before escalating or remediating:
| Step | Action | Expected Output | Failure Condition |
|---|---|---|---|
| 1 | [Diagnostic action] | [What success looks like] | [What triggers escalation] |
| 2 | [Diagnostic action] | [What success looks like] | [What triggers escalation] |
| 3 | [Diagnostic action] | [What success looks like] | [What triggers escalation] |
Diagnostic script / tool: [script name or command]
Data collected:
- [Piece of telemetry or state collected]
- [Piece of telemetry or state collected]
- [Piece of telemetry or state collected]
4. Classification Logic
Section titled “4. Classification Logic”How the automation layer decides what to do:
IF [condition A] AND [condition B] THEN → Auto-remediation (Tier 1): [Action]ELIF [condition C] THEN → Propose remediation (Tier 2): [Action] — requires approvalELIF [condition D] THEN → Alert and escalate (Tier 3): page on-callELSE THEN → Alert only — unknown condition, human investigation requiredTier 1 criteria: [Conditions that must all be true before auto-remediation proceeds]
Hard stops — do not auto-remediate if:
- [Condition that excludes auto-remediation]
- [Condition that excludes auto-remediation]
- Multiple devices affected simultaneously (may indicate broader issue)
5. Remediation Actions
Section titled “5. Remediation Actions”Tier 1: Automated Remediation
Section titled “Tier 1: Automated Remediation”Action: [Specific remediation action]
Pre-conditions (all must be true):
- [Pre-condition 1]
- [Pre-condition 2]
Steps:
- [Step 1 — what the automation does]
- [Step 2]
- [Step 3]
- Verify: [What success looks like]
- If verification fails: [Escalation action]
Rollback: [How to undo this action if it makes things worse]
Script / tool: [script name]
Tier 2: Proposed Remediation (Requires Approval)
Section titled “Tier 2: Proposed Remediation (Requires Approval)”Proposed action: [What the automation suggests]
Approval required from: [Role or team]
Presented to approver:
- Diagnosis summary
- Proposed action and expected effect
- Rollback plan
- Estimated impact if not actioned
Approval channel: [Slack channel / ITSM ticket / ServiceNow approval workflow]
Tier 3: Escalation (Alert Only)
Section titled “Tier 3: Escalation (Alert Only)”Page: [On-call role]
Alert contents:
- Incident type and affected device(s)
- Diagnostic data collected
- Suggested first investigation steps
- Link to this runbook
Escalation path: [Primary → Secondary → Management]
6. Manual Remediation Procedure
Section titled “6. Manual Remediation Procedure”For use when automation is unavailable or the incident does not match automated classification:
-
Verify the incident
[Verification command or check] -
Gather state
[Commands to run on affected device] -
Check recent changes
- Review pipeline artefacts for changes deployed in the last 24 hours
- Check ITSM change log for scheduled changes
-
Remediation steps
[Manual remediation steps] -
Verify remediation
[Verification commands] -
Document
- Update the ITSM incident ticket with root cause and resolution
- If a pipeline artefact caused the incident, raise a bug against the relevant SoT entry
7. Post-Incident
Section titled “7. Post-Incident”Incident closed when: [Specific condition — not “when it feels resolved”]
Actions:
- Update ITSM ticket with root cause
- If auto-remediation triggered: review automation log and verify expected behaviour
- If auto-remediation failed or missed: update runbook with new classification condition
- If a configuration change caused the incident: create ADR or SoT schema update to prevent recurrence
- If this is a recurring incident: schedule post-mortem and blameless review
Metrics to record:
- Time to detect (TTD)
- Time to remediate (TTR)
- Was the incident detected by monitoring before a user reported it?
- Did automation handle it without human involvement?
8. Runbook Maintenance
Section titled “8. Runbook Maintenance”| Field | Detail |
|---|---|
| Review cadence | Quarterly, or after any incident handled by this runbook |
| Review trigger | Runbook auto-remediation fails; new device type added to estate; false positive rate increases |
| Reviewer | [Operations lead / automation engineer] |
| Validation method | [How to test this runbook — lab simulation / dry-run mode] |
Adaptation Guide
Section titled “Adaptation Guide”Incident categories to create runbooks for:
| Category | Example incidents |
|---|---|
| Routing | BGP session down, OSPF adjacency loss, routing black hole, prefix leak |
| Connectivity | Interface down, port-channel member failure, MTU mismatch |
| Management plane | SNMP unreachable, syslog gap, NTP drift, SSH connectivity loss |
| Hardware | Interface error rate threshold, transceiver alarm, power supply failure |
| Security | Unexpected ACL hit rate spike, management plane access from unexpected source |
| Performance | Interface utilisation threshold, CPU/memory threshold, queue drop rate |
Tier assignment guidance:
| Assign Tier 1 if… | Assign Tier 2 if… | Assign Tier 3 if… |
|---|---|---|
| Root cause is always the same | Root cause is known but human judgement is needed | Root cause varies or is unknown |
| Remediation is fully reversible | Remediation has moderate impact risk | Remediation risk is high |
| Single device affected | Multiple devices or services affected | Impact scope uncertain |
| Automation has proven reliable for this case | New automation — not yet validated | No proven automated fix exists |
This work is licensed under a Creative Commons Attribution-NonCommercial license.
You are free to use and adapt this material within your organisation for internal purposes. Republishing, selling, or distributing this content (in whole or in part) as a book, course, or other commercial product is not permitted without explicit permission.