Skip to content

Incident Response Runbook Template

Use this template to create runbooks for automated or semi-automated incident response. Each runbook should cover a single, well-defined incident type. Runbooks are executed by the operations automation layer; they are also the documentation that engineers use when automation is unavailable or the incident falls outside automation scope.


Runbook ID: RB-[CATEGORY]-[NUMBER] (e.g., RB-BGP-01) Category: [Routing / Connectivity / Management Plane / Hardware / Security / Performance] Automation tier: [Tier 1 — Auto-remediate / Tier 2 — Propose and approve / Tier 3 — Alert only] Last reviewed: [YYYY-MM] Owner: [Team or role]


What this covers: [One paragraph describing the incident type. Be specific — “BGP session down” not “routing problem”.]

Examples:

  • [Specific example scenario 1]
  • [Specific example scenario 2]

Explicitly out of scope: [Scenarios that look similar but are handled by a different runbook.]


Primary alert: [Alert name / monitoring rule that triggers this runbook]

Alert conditions:

[Alert definition — threshold, duration, source]
Example: BGP session state != Established for > 2 minutes on any peer

Secondary signals: [Additional telemetry that supports the diagnosis — interface counters, system logs, SNMP traps]

False positive rate: [Low / Medium / High — and the conditions that cause false positives]


Steps the automation layer runs before escalating or remediating:

StepActionExpected OutputFailure Condition
1[Diagnostic action][What success looks like][What triggers escalation]
2[Diagnostic action][What success looks like][What triggers escalation]
3[Diagnostic action][What success looks like][What triggers escalation]

Diagnostic script / tool: [script name or command]

Data collected:

  • [Piece of telemetry or state collected]
  • [Piece of telemetry or state collected]
  • [Piece of telemetry or state collected]

How the automation layer decides what to do:

IF [condition A] AND [condition B]
THEN → Auto-remediation (Tier 1): [Action]
ELIF [condition C]
THEN → Propose remediation (Tier 2): [Action] — requires approval
ELIF [condition D]
THEN → Alert and escalate (Tier 3): page on-call
ELSE
THEN → Alert only — unknown condition, human investigation required

Tier 1 criteria: [Conditions that must all be true before auto-remediation proceeds]

Hard stops — do not auto-remediate if:

  • [Condition that excludes auto-remediation]
  • [Condition that excludes auto-remediation]
  • Multiple devices affected simultaneously (may indicate broader issue)

Action: [Specific remediation action]

Pre-conditions (all must be true):

  1. [Pre-condition 1]
  2. [Pre-condition 2]

Steps:

  1. [Step 1 — what the automation does]
  2. [Step 2]
  3. [Step 3]
  4. Verify: [What success looks like]
  5. If verification fails: [Escalation action]

Rollback: [How to undo this action if it makes things worse]

Script / tool: [script name]


Tier 2: Proposed Remediation (Requires Approval)

Section titled “Tier 2: Proposed Remediation (Requires Approval)”

Proposed action: [What the automation suggests]

Approval required from: [Role or team]

Presented to approver:

  • Diagnosis summary
  • Proposed action and expected effect
  • Rollback plan
  • Estimated impact if not actioned

Approval channel: [Slack channel / ITSM ticket / ServiceNow approval workflow]


Page: [On-call role]

Alert contents:

  • Incident type and affected device(s)
  • Diagnostic data collected
  • Suggested first investigation steps
  • Link to this runbook

Escalation path: [Primary → Secondary → Management]


For use when automation is unavailable or the incident does not match automated classification:

  1. Verify the incident

    [Verification command or check]
  2. Gather state

    [Commands to run on affected device]
  3. Check recent changes

    • Review pipeline artefacts for changes deployed in the last 24 hours
    • Check ITSM change log for scheduled changes
  4. Remediation steps

    [Manual remediation steps]
  5. Verify remediation

    [Verification commands]
  6. Document

    • Update the ITSM incident ticket with root cause and resolution
    • If a pipeline artefact caused the incident, raise a bug against the relevant SoT entry

Incident closed when: [Specific condition — not “when it feels resolved”]

Actions:

  • Update ITSM ticket with root cause
  • If auto-remediation triggered: review automation log and verify expected behaviour
  • If auto-remediation failed or missed: update runbook with new classification condition
  • If a configuration change caused the incident: create ADR or SoT schema update to prevent recurrence
  • If this is a recurring incident: schedule post-mortem and blameless review

Metrics to record:

  • Time to detect (TTD)
  • Time to remediate (TTR)
  • Was the incident detected by monitoring before a user reported it?
  • Did automation handle it without human involvement?

FieldDetail
Review cadenceQuarterly, or after any incident handled by this runbook
Review triggerRunbook auto-remediation fails; new device type added to estate; false positive rate increases
Reviewer[Operations lead / automation engineer]
Validation method[How to test this runbook — lab simulation / dry-run mode]

Incident categories to create runbooks for:

CategoryExample incidents
RoutingBGP session down, OSPF adjacency loss, routing black hole, prefix leak
ConnectivityInterface down, port-channel member failure, MTU mismatch
Management planeSNMP unreachable, syslog gap, NTP drift, SSH connectivity loss
HardwareInterface error rate threshold, transceiver alarm, power supply failure
SecurityUnexpected ACL hit rate spike, management plane access from unexpected source
PerformanceInterface utilisation threshold, CPU/memory threshold, queue drop rate

Tier assignment guidance:

Assign Tier 1 if…Assign Tier 2 if…Assign Tier 3 if…
Root cause is always the sameRoot cause is known but human judgement is neededRoot cause varies or is unknown
Remediation is fully reversibleRemediation has moderate impact riskRemediation risk is high
Single device affectedMultiple devices or services affectedImpact scope uncertain
Automation has proven reliable for this caseNew automation — not yet validatedNo proven automated fix exists
Network Automation HandbookPatrick Lau
This work is licensed under a Creative Commons Attribution-NonCommercial license.
You are free to use and adapt this material within your organisation for internal purposes. Republishing, selling, or distributing this content (in whole or in part) as a book, course, or other commercial product is not permitted without explicit permission.