Skip to content

Auto-Remediation Risk Register

The auto-remediation risk register is the governance document for your automation’s decision-making authority. It records every category of drift or fault that the automation layer is authorised to act on, the conditions under which it may act, and the tier of action permitted.

This register should be reviewed quarterly and updated whenever the automation scope changes. Changes to tier assignments should be approved by the automation programme owner.

The register does two things:

  1. Explicitly authorises specific automated actions (reducing ambiguity about what automation is allowed to do)
  2. Explicitly excludes automation from specific scenarios (creating a documented boundary that operations teams and auditors can rely on)

Organisation[Organisation name]
Environment scope[Production / All environments / Specific segments]
Register version[v1.0]
Last reviewed[YYYY-MM]
Approved by[Role / name]
Next review[YYYY-MM]

Tier definitions:

TierActionCriteria
1Auto-remediate without human approvalLow risk, fully reversible, root cause deterministic, automation validated
2Propose remediation, require approvalModerate risk, human judgement needed for context or timing
3Alert only — do not actHigh risk, unknown root cause, or automation not yet validated
4Never automateStructural risk, compliance requirement, or irreversible action

IDScenarioTierPermitted ActionHard StopsRunbookNotes
REM-RTG-01BGP session down (single peer)1Clear BGP session; verify re-establishmentMultiple sessions down simultaneously; session down > 30 minRB-BGP-01
REM-RTG-02BGP session down (multiple peers)3Alert only — page on-callRB-BGP-02Multiple peers may indicate upstream or hardware issue
REM-RTG-03OSPF adjacency loss (single neighbour)2Propose: restart OSPF processMultiple adjacencies affected; topology change in last 4hRB-OSPF-01
REM-RTG-04Prefix leak detected (route appearing in wrong VRF)3Alert onlyRB-RTG-03Requires human investigation of policy
REM-RTG-05Routing black hole detected4Never automateRB-RTG-04Root cause investigation required; automated action risks compounding the fault

IDScenarioTierPermitted ActionHard StopsRunbookNotes
REM-CON-01Interface flapping (>3 state changes in 5 min)2Propose: err-disable interfaceCore/uplink interfaces; port-channel membersRB-INT-01
REM-CON-02Port-channel member down3Alert onlyRB-LAG-01Port-channel resilience masks single member failures; investigation required
REM-CON-03Interface utilisation > 90% sustained 15 min2Propose: traffic engineering adjustmentRB-INT-02Requires human decision on rerouting
REM-CON-04MTU mismatch detected3Alert onlyRB-INT-03Configuration change required; use pipeline

IDScenarioTierPermitted ActionHard StopsRunbookNotes
REM-MGMT-01Configuration drift from SoT1Re-apply SoT configuration (merge mode)Drift on >3 devices simultaneously; drift on security ACLs; production change window not openRB-DRIFT-01Pipeline-driven remediation only
REM-MGMT-02SNMP polling failure (device unreachable)2Propose: verify management plane connectivityRB-MGMT-01
REM-MGMT-03NTP drift > 5 seconds1Re-sync NTP; verify stratumRB-MGMT-02
REM-MGMT-04Syslog forwarding gap > 15 min2Propose: restart syslog processRB-MGMT-03
REM-MGMT-05SSH connectivity loss to device3Alert onlyRB-MGMT-04May indicate broader management plane failure

IDScenarioTierPermitted ActionHard StopsRunbookNotes
REM-SEC-01ACL hit rate spike (>10x baseline)3Alert only — notify security teamRB-SEC-01Security events require human triage
REM-SEC-02Management plane access from unexpected source IP3Alert onlyRB-SEC-02
REM-SEC-03Security ACL change detected outside pipeline4Never automateRB-SEC-03Out-of-band ACL changes require immediate human investigation; automated response risks masking a security incident
REM-SEC-04New BGP peer announcement from untrusted AS4Never automateRB-SEC-04Requires security and routing team joint review

IDScenarioTierPermitted ActionHard StopsRunbookNotes
REM-HW-01Interface CRC error rate > threshold3Alert only — flag for hardware reviewRB-HW-01
REM-HW-02Transceiver receive power below threshold3Alert onlyRB-HW-02Hardware replacement required
REM-HW-03CPU utilisation > 80% sustained 10 min3Alert onlyRB-HW-03Root cause investigation required
REM-HW-04Memory utilisation > 85%3Alert onlyRB-HW-04
REM-HW-05Power supply failure3Alert only — page on-callRB-HW-05Physical intervention required

Tier 4 — Permanently Excluded from Automation

Section titled “Tier 4 — Permanently Excluded from Automation”

The following actions are permanently excluded from automated execution, regardless of how well the root cause is understood. Record the reason explicitly.

ScenarioReason for permanent exclusion
Security ACL removal or permissive rule additionRegulatory requirement (MiFID II / FCA SYSC) — all firewall/ACL changes require human authorisation
BGP policy modificationRouting policy changes have wide blast radius; human architectural review required
VRF creation or deletionCross-zone routing implication; requires security and architecture approval
Any configuration change on border/edge devicesHighest risk — requires change control and senior approval
Firmware or OS upgradeNon-reversible; requires maintenance window and human oversight
[Add additional exclusions][Reason]

Before adding a new Tier 1 entry, all of the following must be true:

  • Root cause is deterministic — the same trigger always has the same root cause
  • The remediation action is fully reversible
  • The action has been validated in a lab or staging environment
  • The automation has run in dry-run mode in production for at least 30 days without false positives
  • The hard stops are explicitly defined
  • A runbook exists for the manual equivalent
  • The entry has been reviewed and approved by the automation programme owner

Promotion path: Tier 3 (observe) → Tier 2 (propose, validate judgement) → Tier 1 (auto-act)

New categories should start at Tier 3. Promotion to Tier 2 requires 30 days of alert-only observation with documented confirmation that alert conditions correctly identify the target scenario. Promotion to Tier 1 requires 60 days at Tier 2 with approval rate > 95% and zero cases where an approval was later judged incorrect.


Review triggerAction
Quarterly reviewValidate all Tier 1 entries against recent incident data; check hard stops are still sufficient
Auto-remediation acted incorrectlyImmediate review — demote to Tier 2 or 3 pending investigation
New device type added to estateReview which entries apply; may require validation period before Tier 1 status
Regulatory changeReview all security-related entries
Post-incident reviewCheck whether automation should have acted differently

Approval authority for tier changes:

ChangeRequired approval
Tier 3 → Tier 2Automation lead
Tier 2 → Tier 1Automation programme owner + operations lead
Any → Tier 4Automation programme owner
Tier 4 → any lowerAutomation programme owner + CISO/security lead
Network Automation HandbookPatrick Lau
This work is licensed under a Creative Commons Attribution-NonCommercial license.
You are free to use and adapt this material within your organisation for internal purposes. Republishing, selling, or distributing this content (in whole or in part) as a book, course, or other commercial product is not permitted without explicit permission.