Skip to content

Operations Automation

“The pipeline handles change. Operations automation handles everything that happens between changes.”

The CI/CD pipeline is a well-defined, bounded workflow. It runs when a change is proposed, validates it, deploys it, and confirms the outcome. But the network operates continuously — faults occur outside change windows, configurations drift, incidents require diagnosis and response, and operational data accumulates that could inform better decisions if anyone had time to look at it.

Operations automation addresses all of this. It is the layer that monitors the network continuously, detects deviations from intended state, initiates diagnostic and remediation workflows, and ensures that the operational workload scales with the automation platform rather than with the number of engineers on shift.

This chapter covers the five components of operations automation — observability, incident response, change execution, auto-remediation, and the closed-loop principle — and applies the product thinking discipline that determines whether operational automation is sustained or abandoned.


Before describing the components, it is worth being precise about how operations changes at each maturity level. The transition is not just from manual to automated — it is from reactive to proactive, and eventually to predictive.

Maturity LevelOperational ModelEngineer’s Primary Activity
Level 1Ticket-driven, manual triageResponding to incidents and change requests
Level 2Some scripts reduce repetitionResponding to incidents; scripting for known patterns
Level 3Standardised workflows, consistent executionTriaging incidents; operating automation platform
Level 4Automated detection, runbook automation, drift correctionManaging the automation platform; handling exceptions
Level 5Closed-loop, self-healing for known classesGoverning intent; reviewing and refining automation

The operational goal of this chapter’s patterns is to move teams from Level 3 toward Level 4: shifting the engineer’s primary activity from reactive execution to exception handling and platform improvement. Level 5 capabilities — the closed-loop, self-healing system — are addressed in Chapter 11.


Operations automation begins with reliable observation. A system that cannot accurately detect what is happening cannot respond correctly. Alert noise — false positives that trigger unnecessary responses — is as damaging as detection gaps.

A complete operational observability stack has three layers:

graph TD
    DEV["Network Devices<br>(EOS, IOS, etc.)"]

    subgraph "Collection"
        TEL["Streaming Telemetry<br>gNMI<br>High-frequency, structured"]
        SNMP["SNMP Polling<br>Legacy fallback<br>Lower frequency"]
        LOG["Syslog<br>Event-driven<br>Text-based"]
        CFG["Config Backup<br>Oxidized<br>Change detection"]
    end

    subgraph "Processing"
        AGG["Aggregation & Normalisation"]
        COR["Correlation & Enrichment"]
    end

    subgraph "Consumption"
        DASH["Operations Dashboard"]
        ALERT["Alerting & Incident Trigger"]
        DRIFT["Drift Detection"]
        FEED["Closed-Loop Feedback"]
    end

    DEV --> TEL & SNMP & LOG & CFG
    TEL & SNMP & LOG --> AGG
    CFG --> DRIFT
    AGG --> COR
    COR --> DASH & ALERT & FEED
    DRIFT --> ALERT

Streaming telemetry (gNMI) provides high-frequency, structured data — interface counters, BGP session state, hardware health, routing table changes — streamed continuously from the device. This is the data foundation for real-time operational visibility and, ultimately, for closed-loop automation. Modern devices supporting OpenConfig or vendor telemetry models should be configured for streaming telemetry as the primary collection mechanism.

SNMP polling remains necessary for older devices or for data not available via streaming telemetry. It is the legacy baseline, not the long-term strategy. Where both are available, prefer streaming telemetry for its lower overhead and higher granularity.

Syslog provides event-driven visibility — link state changes, BGP session events, authentication failures, hardware alerts. Centralised syslog with structured parsing turns device events into searchable, correlatable records.

Configuration backup and change detection (Oxidized or equivalent) periodically backs up running device configurations and detects when they change between backup cycles. A configuration change on a device that was not initiated through the automation pipeline is a drift event. The backup system is the observation layer for drift detection.

The most common observability failure is not insufficient data — it is too much noise. An operations team that receives hundreds of alerts per day develops alert fatigue, begins ignoring alerts, and misses the signals that matter.

Alert design principles:

Alert on symptoms, not causes. A BGP session going down is a cause. A trading platform becoming unreachable is a symptom. Alert on the symptom — what the business experiences — and use automated diagnostics to determine the cause. This reduces the number of alerts while increasing their actionability.

Suppress correlated alerts. If a spine switch fails and 20 downstream BGP sessions drop simultaneously, that is one incident with 20 correlated alerts, not 20 separate incidents. Alert correlation — grouping related alerts into a single incident — is the single most effective noise reduction mechanism.

Define alert thresholds based on operational experience. An interface at 80% utilisation may be normal for some links and alarming for others. Start with conservative thresholds and refine based on operational experience. Alerts that fire frequently without requiring action should be reconfigured or suppressed.

Separate operational alerts from informational notifications. An alert should require a response. An informational notification (a successful deployment, a drift correction, a capacity threshold reached) should be visible but not urgent. The distinction matters: if everything is an alert, nothing is.

SuzieQ: structured network state observability

Section titled “SuzieQ: structured network state observability”

Telemetry and syslog answer the question what is happening to the network right now. SuzieQ answers a related but distinct question: what is the state of the network, and how does it compare to what it was at a previous point in time.

SuzieQ is an open-source network observability tool that collects structured network state — routing tables, BGP sessions, interfaces, MAC tables, VLAN assignments, OSPF adjacencies, and more — across multi-vendor environments and stores it in a queryable time-series database. Where streaming telemetry delivers metrics and counters, SuzieQ captures the logical and operational state of the network as a structured dataset.

What SuzieQ captures

SuzieQ operates by connecting to devices via SSH or REST APIs and collecting state information across a defined set of network primitives:

State CategoryExamples
RoutingRoute tables, BGP session state, next-hops, route origins
InterfacesInterface state, MTU, speed, error counters
Layer 2MAC tables, VLAN assignments, spanning tree state
ProtocolsOSPF adjacencies, EVPN overlays, LLDP neighbours
DeviceCPU, memory, software version, uptime

Each collection run produces a snapshot. SuzieQ accumulates these snapshots over time, enabling queries that span historical state. This is the property that makes it operationally valuable: the ability to compare now against then.

Time-travel troubleshooting

The most immediate operational benefit of SuzieQ is reducing mean time to diagnose (MTTD) by making historical state queryable — without relying on engineer memory or manual data collection after the fact.

A typical troubleshooting sequence without SuzieQ:

  1. Alert fires
  2. Engineer connects to affected devices and collects current state
  3. Engineer attempts to reconstruct what state was — based on logs, telemetry graphs, and notes — at the time the problem began
  4. Time is lost reconstructing context that was never captured systematically

With SuzieQ:

  1. Alert fires
  2. Engineer queries SuzieQ for BGP session state, route table, and interface status at the time of the alert, and compares against state 30 minutes prior
  3. The diff shows exactly what changed: which route was withdrawn, which neighbour was lost, which interface transitioned
  4. Diagnosis is data-driven from structured historical state, not reconstructed from incomplete evidence
graph LR
    SZQ["SuzieQ<br>Structured state DB<br>Historical snapshots"]
    NOW["Current State Query<br>What does the network look like right now?"]
    HIST["Historical Query<br>What did it look like before the incident?"]
    DIFF["State Diff<br>What changed between T-30min and T-now?"]

    SZQ --> NOW & HIST
    NOW & HIST --> DIFF
    DIFF --> DIAG["Faster Diagnosis / Reduced MTTD"]

Integration into the observability stack

SuzieQ complements rather than replaces streaming telemetry and syslog. Telemetry provides high-frequency metrics for alerting and trending. SuzieQ provides structured state snapshots for diagnosis and comparative analysis. The two serve different queries and should both be present in a mature observability stack.

A practical deployment pattern:

  • SuzieQ collects state snapshots on a regular cycle — every 60 seconds for critical network elements, every 5 minutes for more stable segments
  • Snapshot frequency increases automatically when an alert fires, ensuring high-resolution state history around incident windows
  • The automated diagnostic bundle assembled at alert time includes a SuzieQ state diff: the snapshot immediately before the alert compared against current state
  • Engineers can query SuzieQ directly via CLI or its REST API from within incident response workflows

Proactive correctness checks

Beyond troubleshooting, SuzieQ enables scheduled correctness verification — queries that confirm the network is in the expected state without waiting for an alert to surface a problem:

  • Are all BGP sessions that should be established, established? — a scheduled SuzieQ query verifies this every 5 minutes across all devices
  • Is every leaf carrying the expected VLAN set? — a query compares the observed VLAN table against the SoT
  • Has any prefix changed its next-hop in the last hour? — a query surfaces routing changes that may not have triggered a telemetry threshold

These checks provide an additional layer of intent verification: confirming that what the network is doing matches what it should be doing, expressed as structured queries rather than static alert thresholds. Any deviation feeds into the incident response workflow with its context already attached.


ACME’s lon-dc1 fabric streams interface counters and BGP session state via gNMI every 30 seconds. BGP session state changes trigger immediate alerts regardless of the polling cycle. Interface utilisation alerts fire at 70% sustained for 5 minutes (warning) and 90% sustained for 2 minutes (critical).

The branch offices use SNMP polling at 5-minute intervals for capacity metrics, with syslog for event-driven alerts. Oxidized backs up all device configurations every 6 hours and triggers a drift alert on any change not present in the pipeline’s deployment log within the last 8 hours.

ACME deploys SuzieQ across the lon-dc1 fabric and all branch office CE devices. Snapshot collection runs every 60 seconds for the DC fabric and every 5 minutes for branch offices. Two SuzieQ-driven checks run on a scheduled basis:

BGP completeness check — every 5 minutes, verifies that all expected eBGP and iBGP sessions are established across the fabric. Any session not in the expected state raises a classification event in the incident response workflow rather than a raw alert, ensuring full diagnostic context is assembled before an engineer is paged.

Route table consistency check — every 15 minutes, verifies that all leaf switches have a consistent view of key prefixes (management ranges, trading platform subnets, inter-DC links). Any prefix missing from more than one leaf in the same pod triggers an immediate alert.

The diagnostic bundle assembled on alert fire now includes a SuzieQ state diff covering the 30 minutes prior to the alert. In ACME’s environment, this reduced average MTTD for routing-related incidents from 12 minutes to under 4 minutes, eliminating the manual effort of reconstructing pre-incident network state.


When an alert fires, two things happen in sequence: diagnosis and response. In manual operations, both are human activities. In an automated operations environment, diagnosis is largely automated, and response is automated for known, low-risk incident patterns.

The first automated step after an alert fires is diagnostic data collection. Before a human engineer is paged, the system should have already gathered the information they would collect manually:

  • Device state at the time of the alert (BGP session state, interface status, routing table)
  • Recent syslog events from affected devices
  • Telemetry data showing the state trajectory leading up to the alert
  • Configuration diff between current state and the last known-good SoT backup
  • SuzieQ state diff showing exactly what changed in network state in the period leading up to the alert
  • Correlated events from other devices in the affected path

Packaging this information automatically — attached to the incident ticket when it is created — reduces mean time to diagnose (MTTD) significantly. An engineer receiving a page finds a complete diagnostic context, not a bare alert.

graph LR
    ALERT["Alert Fires<br>(BGP session down)"] --> DIAG["Automated Diagnostics<br>─────────────────<br>BGP session state<br>Interface status<br>Recent syslog events<br>Telemetry timeline<br>Config vs SoT diff"]
    DIAG --> TICKET["ITSM Ticket Created (with diagnostic bundle)"]
    TICKET --> CLASSIFY{"Classify: Known pattern?"}
    CLASSIFY -->|"Yes"| AUTO["Automated Runbook Executes"]
    CLASSIFY -->|"No"| PAGE["Page On-Call Engineer (with diagnostics)"]
    AUTO --> VERIFY["Verify Resolution"]
    VERIFY -->|"Resolved"| CLOSE["Close Ticket"]
    VERIFY -->|"Not Resolved"| PAGE

A runbook is a structured troubleshooting and resolution procedure for a known incident type. When automated, it becomes an executable workflow that the system can run in response to a classified alert.

The value of runbook automation is precision and consistency: the same steps, in the same order, every time, with every output logged. The human reviewer can see exactly what the automation did and verify that it was correct.

ACME’s operational runbook library covers the following incident patterns:

Incident TypeClassification SignalAutomated ResponseEscalation Trigger
BGP session downBGP session state alert + session has previously been stableCheck interface state; if interface up, attempt BGP clearInterface also down, or session does not re-establish within 3 minutes
Interface flappingInterface state changes >3 times in 5 minutesCollect interface statistics, error counters; check port configPhysical error counters above threshold
Configuration drift detectedOxidized diff alertCompare drift to SoT; if drift is a known-safe pattern, log and suppress; otherwise alertDrift affects security-relevant configuration
VLAN missing on leafTraffic blackhole alert on known VLANVerify VLAN in SoT; if present, re-apply SoT configuration to deviceVLAN missing from SoT (design change required)
High CPU on deviceCPU threshold alertCollect process list; correlate with recent changesCPU above threshold for > 10 minutes

The classification logic determines which runbook runs. For known, well-understood patterns, the runbook may include an auto-remediation step. For others, it stops after diagnostics and hands off to a human with a complete diagnostic bundle.

Escalation should be explicit, time-bound, and tracked. An incident that is not acknowledged within 5 minutes should escalate to a second on-call engineer. An incident not resolved within 30 minutes should escalate to the on-call manager. These thresholds are configured in the workflow orchestrator, not left to human initiative.

The integration between the incident response workflow and the ITSM system (ServiceNow for ACME) provides the audit trail: every alert, every diagnostic action, every escalation, and every resolution step is recorded with timestamps and actor identity.


The CI/CD pipeline handles planned changes. Operations automation handles the change execution patterns that planned changes alone cannot address: scheduled maintenance windows, emergency changes, and the operational lifecycle of existing automation.

For change types that are well-understood and low-risk — scheduled firmware updates, certificate renewals, log file rotation on management systems — automation can execute changes in pre-defined maintenance windows without requiring an engineer to be present.

The pattern:

  1. The change is scheduled via the workflow orchestrator (day, time, scope)
  2. At the scheduled time, the orchestrator triggers the pipeline with the pre-validated change
  3. The pipeline executes: validation → deployment → post-deploy verification
  4. If verification passes, the orchestrator closes the change ticket and notifies stakeholders
  5. If verification fails, the pipeline rolls back and pages on-call for investigation

The key governance requirement: automated change windows must have a human-approved change ticket and a defined verification criterion. The automation executes a human decision — it does not make the decision to change.

Emergency changes — network changes required outside the normal pipeline process to restore service during an incident — need a defined break-glass procedure. The procedure must be fast enough to be useful during an incident while maintaining the governance properties that make the automation trustworthy.

ACME’s break-glass procedure:

  1. The engineer obtains break-glass credentials (time-limited, tracked, separate from pipeline credentials)
  2. The engineer makes the minimum necessary change directly to the device via CLI
  3. The change is logged immediately in the ITSM incident ticket (what was changed, why, at what time)
  4. Within 24 hours, the engineer raises a pipeline change to update the SoT to reflect the emergency change
  5. Oxidized detects the drift and raises a drift alert; the drift is acknowledged against the open incident ticket
  6. The pipeline change closes the loop: the emergency change is normalised into the automation model

Emergency changes that are never normalised into the SoT are the primary source of configuration drift. The break-glass procedure must include the normalisation step as a requirement, not a recommendation.


Auto-remediation is the capability that distinguishes a mature operations platform from a well-monitored manual operation. It is also the capability most often implemented recklessly — automating responses without sufficient classification logic, resulting in automation that makes problems worse.

The key design principle: risk-tier first, automate second.

Configuration drift detection and correction

Section titled “Configuration drift detection and correction”

Drift — the divergence between a device’s running configuration and the source of truth — is detected by comparing Oxidized backups against the last-deployed configuration from the pipeline.

Three categories of drift, each with a different response:

Expected drift: Drift that is known, documented, and acceptable. Dynamic routing state (BGP route tables, ARP/MAC caches) changes continuously and should not trigger alerts. The drift detection system must be configured to distinguish between static configuration drift and expected dynamic state changes.

Benign drift: Configuration changes that do not affect security, compliance, or operational correctness — a description field update, a log message format change applied by an automated platform upgrade. These should be logged and reported but not trigger an immediate response. The next pipeline run will correct them.

Significant drift: Configuration changes that affect security policies, routing behaviour, or management plane access. These require immediate attention: alert the operations team, identify the source of the drift, and remediate via the pipeline.

For benign and significant drift, the remediation path is the same pipeline used for planned changes — a merge request that updates the SoT to either accept the drift (if it was an intentional out-of-band change) or revert it (if it was unintended). This maintains the governance discipline even for remediation actions.

Beyond drift, some operational conditions warrant automated response triggered by telemetry events rather than configuration comparison.

The pattern: observe → classify → decide → act.

graph LR
    OBS["Observe<br>Telemetry event or alert fires"] --> CLASS["Classify<br>Is this a known remediable pattern?"]
    CLASS -->|"Known, low-risk"| DIAG["Diagnose<br>Collect full context before acting"]
    CLASS -->|"Unknown or high-risk"| HUMAN["Human Review<br>Page on-call with diagnostics"]
    DIAG --> RISK{"Risk Assessment"}
    RISK -->|"Auto-remediate"| ACT["Act<br>Execute runbook automatically"]
    RISK -->|"Propose only"| PROP["Propose<br>Generate remediation for human approval"]
    ACT --> VERIFY["Verify<br>Did remediation resolve the condition?"]
    VERIFY -->|"Yes"| LOG["Log & Close"]
    VERIFY -->|"No"| HUMAN

Not all remediable conditions are equal. The risk tier determines the level of automation applied:

TierCondition ExamplesAutomated ResponseHuman Involvement
Tier 1 — Auto-fixBGP session clear after transient drop; VLAN re-apply after verified drift; interface error counter resetFull automatic executionPost-hoc notification only
Tier 2 — Propose and approveRouting policy change; ACL modification; new static routeGenerate proposed change as MR; alert engineer for approvalApproval required before execution
Tier 3 — Alert onlySecurity policy deviation; unexpected new BGP neighbour; management plane access anomalyImmediate alert with full diagnostic bundleFull human investigation required
Tier 4 — Never auto-remediateChanges that affect trading system connectivity during market hours; changes to firewall zone boundaries; device replacementHuman-initiated onlyFull change management process

The tier assignments are not permanent — they evolve as the team gains confidence in specific automation patterns and as the verification logic matures. A remediation that starts at Tier 2 may move to Tier 1 after six months of successful execution. A remediation that produces unexpected side-effects should move up a tier until the root cause is understood.

ACME’s Tier 1 automation handles two primary patterns:

BGP session recovery: If a BGP session drops and the interface remains up, the system waits 90 seconds (allowing for normal BGP hold-timer expiry) and attempts a soft BGP clear. If the session re-establishes, the incident is logged and closed automatically. If it does not re-establish within 3 minutes, the system escalates to on-call with the full diagnostic bundle.

Management plane drift correction: If Oxidized detects that a device’s management configuration (syslog servers, SNMP configuration, management VRF) has drifted from the SoT, and the drift is not associated with any open change ticket, the system raises a pipeline merge request to re-apply the SoT configuration. The MR is auto-approved if the drift is exclusively in management configuration (Tier 1 scope) and the pipeline passes all validation stages.

These two patterns alone eliminate a class of manual operational tasks that previously consumed several hours of engineering time per week.


The five components above — observability, incident response, change execution, and auto-remediation — each address a specific operational challenge. The closed-loop principle connects them into a continuous feedback cycle.

The loop has four steps:

┌──────────────────────────────────┐
│ │
▼ │
OBSERVE REMEDIATE
Telemetry, logs, Pipeline or runbook
config backups applies correction
│ ▲
▼ │
(deviation detected) │
│ │
▼ │
COMPARE DECIDE
Actual state vs ────▶ Risk tier determines
intended state (SoT) response type

Observe: Telemetry, syslog, and configuration backup systems continuously monitor actual network state.

Compare: The observed state is compared against the intended state — the source of truth and the design intents. Any deviation is a candidate for automated response.

Decide: The risk-tiered framework determines whether the response is automatic, proposed for approval, or escalated to a human.

Remediate: The correction is applied via the automation pipeline, maintaining the governance discipline that applies to all changes.

The loop then restarts: the post-remediation state is observed, compared against intent, and if the remediation was successful, the deviation no longer exists. If it is not successful, the loop escalates.

This is the operational expression of the handbook’s core principle: configuration is an output of intent, not an input. The closed loop continuously re-asserts intent against actual state. Deviations are detected immediately and addressed systematically — not discovered in the next audit or during the next incident.

The closed-loop principle is the foundation for the self-healing capabilities described in Chapter 11. The difference between operations automation (this chapter) and self-healing (Chapter 11) is the scope of what is remediable: operations automation handles well-understood, bounded patterns; self-healing extends this to broader and more complex failure modes using intent-aware reasoning.


Product Thinking for Operations Automation

Section titled “Product Thinking for Operations Automation”

Operations automation has all the properties of an internal product: it has users (the operations team), it has consumers (the teams that depend on the network), it has features (the runbooks, the alerts, the auto-remediations), and it accumulates technical debt if not actively maintained.

Treating it as a product rather than a configuration exercise determines whether it remains useful or slowly degrades.

Every operational pain point that automation could address should be in a backlog, prioritised by frequency and impact. The backlog is maintained by whoever owns the automation platform — the function introduced in the A-Team framework in Chapter 2.

Useful backlog items:

  • A runbook for a recurring incident type that is currently handled manually
  • An alert threshold that is generating false positives and needs refinement
  • A drift pattern that is being manually corrected and should be automated
  • A diagnostic data collection step that engineers always perform manually after an alert

The backlog should be reviewed quarterly and prioritised based on operational data: which incident types are consuming the most engineer time? Which alert patterns have the highest false-positive rate? Which manual tasks recur most frequently?

The metrics that demonstrate the value of operations automation and track its health:

MetricWhat It MeasuresTarget Direction
MTTD (Mean Time to Detect)Time from incident occurrence to alert fireDecreasing
MTTR (Mean Time to Resolve)Time from alert to incident resolutionDecreasing
Auto-resolution rate% of incidents resolved without human interventionIncreasing
False-positive rate% of alerts that did not require actionDecreasing
Drift frequencyRate of configuration drift events per weekDecreasing
Runbook coverage% of recurring incident types with an automated runbookIncreasing
Engineer time on reactive work% of ops team time spent on incident responseDecreasing

These metrics should be visible to the operations team and to engineering leadership. The auto-resolution rate and MTTR are the most meaningful for business communication — they translate directly to service reliability and operational cost.

Operations automation has a specific trust problem: it acts on production infrastructure, often without a human in the loop. Engineers who do not trust the automation will disable it, bypass it, or override it. Engineers who over-trust it will not notice when it produces incorrect results.

Trust is built through transparency and graduated autonomy:

Dry-run mode first. When deploying a new runbook or auto-remediation pattern, run it in dry-run mode for two weeks: execute all the diagnostic steps, generate the proposed remediation, and log what the automation would have done — but do not execute the remediation. Review the dry-run logs. If the proposed actions are consistently correct, promote to Tier 1 or Tier 2 execution.

Every automated action is logged. Engineers should be able to see, at any time, exactly what the automation executed and why. Opaque automation — where the system acts but no one can see what it did — erodes trust even when the outcomes are correct.

Make it easy to intervene. The automation should never prevent an engineer from taking manual control. The break-glass procedure, the ability to disable specific runbooks, and clear escalation paths all demonstrate that the automation is a tool, not a replacement for human judgement.


TemplatePurposeFormat
Incident Response Runbook TemplateStructure for documenting and automating incident runbooksMarkdown
Auto-Remediation Risk RegisterRisk tier assignments for automated responsesMarkdown

Operations automation is where the investment in the automation platform pays its most continuous dividend. The CI/CD pipeline handles change; operations automation handles everything between changes — detecting deviations, diagnosing incidents, correcting drift, and maintaining the network in conformance with stated intent.

The five components work together as a system. Observability provides the signal. Incident response automation converts signals into structured, contextualised responses. Change execution applies the automation discipline to the full change lifecycle. Auto-remediation closes the loop for known, bounded patterns. The closed-loop principle connects all five into a continuous cycle of observe, compare, decide, and remediate.

Treat operations automation as a product: maintain a backlog, measure adoption and impact, build trust through transparency, and evolve the automation based on operational experience. The automation that is trusted, maintained, and actively improved is the automation that scales the team’s capacity without scaling the team’s headcount.


Next: Chapter 9 — Greenfield Design — applying automation-native principles from the start.

Network Automation HandbookPatrick Lau
This work is licensed under a Creative Commons Attribution-NonCommercial license.
You are free to use and adapt this material within your organisation for internal purposes. Republishing, selling, or distributing this content (in whole or in part) as a book, course, or other commercial product is not permitted without explicit permission.