To turn OT incidents into security improvements, run a structured lessons learned process in two passes: a hotwash within 24–72 hours (capture what happened, what worked, what broke) and a root cause + corrective action review within 2–4 weeks (identify systemic causes, prioritize remediations, and assign owners and deadlines). The highest-impact improvements usually fall into five buckets: remote access hardening, conduit/segmentation tightening, identity and privilege hygiene, backup and recovery readiness, and OT-aware detection engineering. Track outcomes with metrics like time to validate operational impact, recurrence rate within 90 days, and percent of incidents with complete asset/zone context.
Why OT “lessons learned” often fail
Most organizations genuinely want to improve after an OT incident—then normal operations return, priorities shift, and the improvement work dies quietly. The failure patterns are predictable.
Failure pattern 1: Treating “lessons learned” as a meeting, not a system
A single meeting produces opinions. A system produces outcomes:
- documented facts,
- prioritized corrective actions,
- funded changes,
- verified closure,
- and measurable reduction in recurrence.
Failure pattern 2: Mixing safety/operations and security without clear boundaries
OT incidents live at the intersection of:
- safety (hazards and protections),
- operations (uptime, quality, throughput),
- engineering (controllers, HMIs, networks),
- security (identity, detection, containment).
If you don’t define who owns which decisions, the review becomes a debate instead of a plan.
Failure pattern 3: “Root cause” stops at the first convenient answer
Common “root causes” that are not roots:
- “Someone clicked a phishing email.”
- “A firewall rule was misconfigured.”
- “A vendor account was compromised.”
Those are triggers. The root cause is usually systemic: weak pathway controls, lack of MFA, flat networks, missing monitoring, untested restores, unclear approvals, or exceptions that became permanent.
Failure pattern 4: Fixes that are technically correct but operationally impossible
OT improvements fail when they:
- require outages no one approved,
- break vendor support agreements,
- require skills or tools the team doesn’t have,
- or conflict with production schedules.
Good corrective actions are not only secure—they are schedulable, testable, and supportable.
Failure pattern 5: No one tracks the actions to completion
If actions aren’t tracked like production work, they don’t ship.
OT lessons learned must end with:
- owners,
- deadlines,
- acceptance criteria,
- and verification.
What success looks like: the OT improvement loop
A mature organization treats every incident (even a near-miss) as input to a continuous improvement engine.
The OT incident-to-improvement loop
- Capture facts fast (hotwash)
- Diagnose root causes (technical + process)
- Prioritize corrective actions by consequence
- Implement in safe windows (engineering + operations alignment)
- Verify effectiveness (testing + monitoring)
- Update runbooks, detections, and architecture baselines
- Measure outcomes (recurrence, time-to-contain, time-to-restore safe operations)
A useful mindset shift
Instead of asking:
- “Who made the mistake?”
Ask:
- “What conditions made the mistake likely, repeatable, and high impact?”
This shifts reviews away from blame and toward resilience.
Two reviews, two purposes: hotwash vs root cause review
Trying to do everything in one meeting leads to shallow conclusions and missed evidence. Use two structured passes.
Review #1: Hotwash (within 24–72 hours)
Purpose: capture reality while it’s fresh.
Outputs:
- agreed incident timeline (high level),
- what worked / what didn’t,
- immediate “stop-the-bleeding” fixes,
- evidence locations and who owns deeper analysis.
Rules:
- no blame,
- no speculative attribution without evidence,
- focus on decisions and their outcomes.
Review #2: Root cause + corrective action review (within 2–4 weeks)
Purpose: produce a funded, owned improvement plan.
Outputs:
- root cause statement(s),
- contributing factors,
- corrective actions with owners and deadlines,
- updated risk register/exceptions list,
- detection and runbook updates,
- architecture changes and backlog.
Rules:
- decisions require acceptance criteria,
- actions must be operationally schedulable,
- assign a single accountable owner per action.
The OT lessons learned framework (step-by-step)
This section is the “how” you can operationalize across sites.
Step 1: Define the incident boundary (what was in-scope?)
Before you review, agree on scope:
- which sites, zones, and cells were affected,
- which systems were directly impacted (encrypted, unavailable, manipulated),
- which were indirectly impacted (lost visibility, auth failures, delayed operations),
- what time period is included (e.g., from initial access to full recovery).
Why it matters: without boundaries, teams argue past each other.
Step 2: Build a single timeline (the non-negotiable artifact)
Create one timeline with:
- timezone,
- known clock skew,
- source references (firewall logs, jump host sessions, OT monitoring, EDR, operator reports).
A reliable timeline enables:
- defensible root cause,
- precise corrective actions,
- and better detection rules.
Step 3: Identify the “decision points” (where outcomes changed)
Most incidents pivot on a few decisions:
- when the incident was declared,
- whether remote access was shut down,
- whether a segment was isolated,
- whether operations continued or stopped,
- whether systems were rebuilt or “cleaned”
- restore order choices.
Document decision points like this:
- Decision: what was decided
- Owner: who approved
- Inputs: what evidence was available at the time
- Action: what changed in the environment
- Outcome: what improved/worsened
- Alternative: what you would do next time and why
Step 4: Separate symptoms, triggers, and causes
Use a simple classification:
- Symptom: what you saw (e.g., HMI unavailable, ransomware note)
- Trigger: what immediately caused it (e.g., encryption on OT DMZ file server)
- Cause: what allowed it (e.g., flat DMZ, shared local admin, no MFA, weak allowlisting)
- Control gap: what would have prevented or reduced impact (e.g., session recording, conduit allowlists, immutable backups)
Step 5: Convert gaps into corrective actions (with acceptance criteria)
Every gap should become an action that is:
- specific,
- testable,
- time-bound,
- and assigned.
Bad action: “Improve segmentation.”
Good action: “Implement conduit allowlist rules between OT DMZ and Level 3: permit only historian replication and patch distribution; block SMB and RDP by default; add rule expiry tracking. Validate with a 7-day monitoring period and zero unapproved flows.”
Step 6: Prioritize fixes using consequence and feasibility
OT prioritization must respect consequences. A useful scoring model:
- Consequence reduction (safety, downtime, integrity)
- Likelihood reduction (how often the path is used/abused)
- Feasibility (outage needed? vendor support? lead time?)
- Time-to-value (days/weeks vs quarters)
- Blast radius (site-specific vs fleet-wide)
A simple weighted score can help avoid politics. Use Score=3C+2L+F+T+B if you want a fast rubric (scale each factor 1–5).
Step 7: Track actions like production work
Create an action register (not a slide) with:
- owner,
- due date,
- dependencies,
- change window requirement,
- test plan,
- verification date,
- and closure evidence.
If you have multiple plants, this becomes a fleet playbook: what’s “global standard” vs “site-specific exception.”
Root cause analysis in OT: how to find the real causes
Root cause analysis (RCA) is not a formality. It’s how you avoid repeating the same incident with different malware names.
Use a “multi-root” model (because OT incidents rarely have one cause)
A single incident often has multiple roots:
- identity control failures,
- pathway/segmentation failures,
- monitoring failures,
- recovery readiness failures,
- process/communication failures.
The goal is not to pick one—it’s to identify the minimum set that meaningfully reduces risk.
Practical RCA methods for OT
1) The “5 Whys” (useful, but don’t stop early)
Example:
- Why did ransomware affect OT visibility?
Because the historian server was encrypted. - Why was the historian encrypted?
Because the attacker accessed it from the OT DMZ. - Why could the attacker access it from the OT DMZ?
Because SMB/RDP were allowed broadly between DMZ and Level 3. - Why were those protocols allowed broadly?
Because exceptions were granted for troubleshooting and never removed. - Why were exceptions never removed?
Because there’s no rule expiry tracking and no quarterly conduit review.
Root causes: exception governance + conduit allowlisting + protocol reduction.
2) Fault Tree thinking (great for consequence-focused OT)
Start with the top event (“loss of control visibility”) and branch into conditions that enabled it (auth failures, network path, server dependency, backup failure).
3) Bowtie-style mapping (good for safety + security alignment)
- Left side: threats and pathways
- Center: incident event
- Right side: consequences and recovery
Then list barriers that failed or were missing (preventive and mitigative).
Common OT root causes (the ones that keep recurring)
- Remote access not controlled (no MFA, shared accounts, no approvals)
- OT DMZ acting as a bridge (not a true buffer)
- Flat networks and permissive conduits
- Over-privileged service accounts and shared local admin
- Lack of OT-aware detection (no visibility into controller writes/downloads)
- Untested backups and unclear recovery order
- Weak change control and maintenance window context missing in SOC triage
- Vendor pathways unmanaged (always-on tools, broad access, no session logs)
From findings to fixes: building a corrective action plan that ships
The output of lessons learned should look like a delivery plan, not an incident report appendix.
The corrective action plan (CAP) structure
Group actions into four categories so leadership can fund and schedule:
- Immediate (0–14 days): low-risk, high-value containment/hardening
- Near-term (15–60 days): changes requiring coordination but minimal outages
- Planned (61–180 days): segmentation projects, platform upgrades, fleet rollouts
- Strategic (180+ days): architecture modernization, identity redesign, vendor contract changes
Write actions at the right level of specificity
Each action needs:
- Statement: what will change
- Owner: one accountable person
- Scope: which sites/zones/assets
- Dependencies: vendor support, outage windows, procurement
- Acceptance criteria: how you’ll prove it worked
- Rollback plan: if it impacts operations
- Date: due and review checkpoints
Use “compensating controls” when patching isn’t realistic
In OT, patching is often delayed. Don’t accept “can’t patch” as “can’t improve.”
Compensating controls include:
- conduit allowlists,
- removing direct inbound routes,
- jump-host-only access,
- application allowlisting on Windows systems,
- least privilege,
- strict vendor access approvals,
- enhanced monitoring and alerting.
Close the loop: validate effectiveness
A corrective action isn’t done when implemented. It’s done when verified.
Examples of verification:
- test restore of a critical OT server from offline backup
- tabletop exercise that proves the new escalation path works
- network test confirming blocked SMB across DMZ-to-L3
- monitoring validation: new detection triggers on controller writes from non-EWS hosts
- audit confirming vendor access is time-boxed and logged
The “Top 10” OT improvements that usually matter most
Across OT incidents (ransomware, remote access abuse, suspicious controller activity), these improvements repeatedly deliver outsized risk reduction.
1) Lock down remote access (employee + vendor)
Do:
- require MFA for all OT access paths,
- enforce jump-host-only access to OT zones,
- time-box vendor sessions and restrict targets,
- record sessions where possible,
- remove shared accounts.
Don’t:
- allow persistent vendor tools with broad reach,
- allow direct VPN into Level 2 networks.
2) Tighten OT DMZ boundaries and conduits
Do:
- treat the DMZ as a buffer, not a highway,
- allowlist only necessary flows,
- block SMB/RDP by default across conduits,
- monitor for new talkers and new services.
Don’t:
- permit “temporary” exceptions with no expiry.
3) Reduce lateral movement on Windows-heavy OT layers
Do:
- remove shared local admin passwords,
- restrict admin shares,
- limit RDP/WinRM,
- implement least privilege and strong credential hygiene.
Don’t:
- rely on “air gap assumptions” while pathways exist.
4) Create recoverable, tested backups (and protect them)
Do:
- maintain offline or immutable backups for critical OT systems,
- test restores quarterly (at least top-tier assets),
- store known-good images for jump hosts and engineering workstations.
Don’t:
- assume backups work because jobs say “success.”
5) Make controller integrity verifiable
Do:
- maintain baselines (“golden” logic/config) for critical controllers,
- implement change detection or scheduled comparisons where feasible,
- log and review engineering downloads/changes.
Don’t:
- treat controller state as unknowable after an incident.
6) Implement OT-aware detections that reflect consequences
Do:
- alert on controller writes and downloads,
- detect new talkers to controllers,
- flag cross-zone communications,
- detect scanning inside OT zones,
- correlate with maintenance windows.
Don’t:
- run OT detection as generic IDS noise.
7) Improve asset context (role + criticality + zone)
Do:
- tag assets with role (PLC/HMI/EWS/historian/jump host),
- map to zones/cells,
- assign criticality tiers.
Don’t:
- investigate incidents with only IP addresses and hostnames.
8) Establish “least-disruptive containment” runbooks
Do:
- define containment ladders (remote access → conduits → host isolation → cell isolation),
- require OT approval for disruptive actions,
- implement rule expiry and rollback plans.
Don’t:
- allow automated quarantines to hit OT assets without review.
9) Strengthen incident communications and approvals
Do:
- establish who can authorize firewall changes, isolation, shutdown,
- build a shared escalation template,
- create a single incident commander role and decision log.
Don’t:
- let parallel teams act independently in the first hour.
10) Practice with tabletops that reflect real constraints
Do:
- include vendors, operations, and controls,
- run scenarios: ransomware near OT, unauthorized controller write, compromised vendor access,
- measure time-to-scope and time-to-contain.
Don’t:
- run table-top exercises that only test SOC workflows.
Detection and monitoring upgrades after incidents
Incidents are the best detection requirements document you will ever get—because they show how you were blind.
Turn incident artifacts into detection use cases
For each incident, ask:
- What did we see first?
- What should we have seen earlier?
- Which signals existed but weren’t collected?
- Which alerts fired but lacked context?
Then create a use-case backlog:
- log sources to onboard,
- correlation rules to add,
- enrichment to implement (site/zone/maintenance window),
- alert routing changes.
The OT detection improvements that pay back fastest
- Correlate VPN login → jump host session → target asset
- Alert on new remote access paths into OT zones
- Alert on controller writes/downloads outside approved windows
- Detect new talkers to PLC networks
- Detect SMB/RDP movement across DMZ-to-L3 conduits
- Detect backup deletion attempts (ransomware precursor)
Add operational context to reduce false positives
A common OT failure is alert fatigue caused by:
- planned maintenance,
- vendor support sessions,
- shift changes,
- engineering downloads.
Fix this with:
- maintenance window feeds (even if manual at first),
- change ticket references,
- allowlists for approved engineering stations,
- time-based expectations.
Validate detection improvements with “replay”
After the incident, simulate:
- could we have detected the path earlier with the new telemetry?
- do we now get fewer alerts, but higher confidence?
- does the alert include asset role, zone, and recommended containment?
Architecture and segmentation upgrades (zones & conduits)
Segmentation is not a poster. It’s an enforceable set of conduits that limit blast radius.
Use incidents to identify “unnecessary trust”
After an incident, list:
- which zones communicated that shouldn’t,
- which protocols crossed boundaries that didn’t need to,
- which systems had access because “it was easier.”
Then convert into:
- updated conduit allowlists,
- protocol restrictions,
- and monitored exceptions with expiry.
A practical post-incident segmentation improvement plan
Phase 1 (fast): tighten conduits without moving assets
- block SMB/RDP where not required
- restrict inbound admin protocols to jump hosts only
- implement temporary deny rules during heightened monitoring windows
Phase 2 (medium): isolate high-risk services
- separate file transfer services
- separate vendor access brokers
- separate patch staging
- reduce shared infrastructure between IT and OT
Phase 3 (longer): restructure zones based on consequence
- separate safety-relevant areas
- segment by process cell/line
- implement true DMZ buffering (no direct IT-to-L3 routes)
Don’t forget “segmentation drift”
Your incident may have happened because:
- a temporary rule became permanent,
- a vendor demanded broad access,
- a troubleshooting change wasn’t rolled back.
Implement:
- quarterly conduit reviews,
- rule expiry,
- alerts on new talkers and new cross-zone flows,
- a lightweight exception register.
Identity, remote access, and vendor pathway hardening
If segmentation is the “walls” identity is the “keys.” Many OT incidents are key failures.
Post-incident identity improvements that reduce re-entry risk
- Rotate credentials used during the incident response (especially privileged)
- Remove shared accounts or strictly govern them with check-in/check-out
- Enforce MFA for OT access paths
- Reduce standing privileges (use time-bound access where possible)
- Audit privileged group membership and service accounts
Vendor access: design for support without permanent risk
A mature vendor access model includes:
- per-vendor named accounts,
- MFA and device restrictions,
- per-session approval,
- time windows,
- target allowlists (which systems they may access),
- session logging/recording,
- quarterly review of vendor accounts and usage.
Remote access acceptance criteria (what “fixed” looks like)
You can declare vendor/remote access hardening “done” when:
- every OT-reaching login is associated with a named identity,
- MFA is enforced,
- access is through a controlled jump host,
- sessions are logged (and ideally recorded),
- and emergency access has a documented approval flow.
Backup, restore, and recovery improvements that reduce downtime
Many OT incident “lessons learned” reveal that recovery was slow not because the malware was advanced, but because recovery was underprepared.
What to improve after an OT incident (in priority order)
1) Backup scope: know what you must restore
Identify and protect:
- jump hosts and remote access brokers,
- historians and key OT servers,
- engineering workstation images and project repositories,
- configuration backups for network devices and firewalls,
- critical recipes, batch data, and HMI configurations.
2) Backup protection: isolate backups from ransomware
Protect backups with:
- offline copies,
- immutable storage,
- separate admin credentials,
- restricted network reachability.
3) Restore sequencing: document the order
OT recovery order often matters more than raw restore speed.
Document:
- dependencies (identity, DNS, time sync),
- licensing requirements,
- vendor install media sources,
- validation steps.
4) Restore testing: prove it quarterly (at least top-tier)
Define success criteria:
- system boots,
- application runs,
- required communications flow,
- operators can view/control as expected,
- logs and monitoring are restored.
The “golden image” advantage
Having known-good images for:
- jump hosts,
- engineering workstations,
- critical OT servers,
turns rebuilds from days into hours—especially during ransomware events.
Process improvements: change control, training, exercises, and communications
Technology fixes reduce risk, but process fixes prevent chaos during the next incident.
Change control improvements that reduce false positives and mistakes
- Require change tickets for controller downloads and network conduit changes
- Add maintenance window schedules into SOC workflows
- Establish a quick “is this expected?” hotline between SOC and OT
Runbook upgrades (what should be added after incidents)
Every incident should update runbooks with:
- the exact log sources used,
- the best containment actions that worked,
- approvals required,
- rollback steps,
- and “do not do” guardrails.
Training that matters: role-based and scenario-based
Effective training targets:
- SOC: OT concepts, zones/cells, consequence-based severity
- OT engineers: evidence preservation, safe containment, remote access risks
- Operations: incident communication, manual mode procedures, escalation
- Leadership: decision criteria for shutdown, external communications
Tabletop exercises: measure, don’t just discuss
After each tabletop, record:
- time to declare an OT incident,
- time to identify the pivot path,
- time to contain at the boundary,
- time to produce a credible scope statement,
- and what information was missing.
Turn gaps into backlog items.
Metrics that prove improvement (and don’t incentivize bad behavior)
Poor metrics drive poor behavior (like closing incidents early to improve MTTR). OT metrics must reflect safe outcomes.
Core metrics for OT lessons learned programs
- Recurrence rate: same entry vector within 90 days
- Time to validate operational context: expected vs unexpected activity
- Time to scope: identify affected zones/assets with confidence
- Time to contain at boundary: remote access/conduit action time
- Time to restore safe operations: not just “systems up”
- Evidence completeness: % incidents with minimum evidence package captured
- Action closure rate: % corrective actions closed on time
- Exception aging: number of “temporary” exceptions older than 60/90 days
A simple “improvement scorecard”
Track quarterly:
- top 5 incident themes,
- top 10 corrective actions completed,
- top 5 detection improvements delivered,
- segmentation drift metrics,
- restore test success rate.
This gives executives a clear view of progress and helps secure funding.
Templates: agenda, report structure, action register, and executive brief
Use these templates to standardize across sites.
Template 1: Hotwash agenda (60–90 minutes)
- Purpose & rules: no blame, focus on facts
- Timeline recap: what happened, when
- What worked: detections, comms, containment, recovery steps
- What didn’t: delays, missing info, tooling gaps
- Decisions log: major decisions and outcomes
- Immediate fixes (0–14 days): pick owners now
- Evidence and follow-ups: who will complete deep dives, by when
Template 2: Root cause review agenda (2–3 hours)
- Confirm timeline and scope
- Identify primary root causes + contributing factors
- Map control gaps to corrective actions
- Prioritize by consequence and feasibility
- Assign owners, due dates, and acceptance criteria
- Confirm funding or escalation path
- Schedule verification and closure review
Template 3: Lessons learned report structure (reader-friendly)
- Executive summary (impact, status, key fixes)
- Incident overview (scope, affected assets by role/zone)
- Timeline (timezone, key events, decision points)
- Root causes and contributing factors
- What went well / what needs improvement
- Corrective action plan (categorized by time horizon)
- Detection and monitoring updates
- Architecture and identity changes
- Recovery and validation improvements
- Appendix: evidence register and references
Template 4: Corrective action register (copy/paste table)
Use a table like this:
| ID | Action | Owner | Scope (Site/Zone) | Priority | Due date | Change window needed | Acceptance criteria | Status |
|---|---|---|---|---|---|---|---|---|
| CA-001 | Enforce MFA for OT VPN + remove shared accounts | IAM Lead | All sites | P1 | YYYY-MM-DD | No | 100% OT access uses MFA; shared accounts removed or governed | Open |
| CA-002 | Block SMB/RDP DMZ→L3 except allowlist | OT Net Eng | Site A DMZ | P1 | YYYY-MM-DD | Yes | Only approved flows allowed; drift alert enabled | In progress |
| CA-003 | Quarterly restore test for historian | OT Ops | All sites | P2 | YYYY-MM-DD | Yes | Restore completes + functional validation checklist passed | Planned |
Template 5: Executive brief (one page)
- What happened (plain language)
- Impact (safety, downtime, cost)
- Current risk status (contained? eradicated? monitoring window?)
- Top 5 root causes (systemic)
- Top 10 corrective actions (with dates)
- What leadership must decide (funding, outages, vendor terms)
Incident-to-improvement playbooks (common OT scenarios)
Use these to turn specific incidents into concrete improvements.
Scenario A: Ransomware approached OT (DMZ or Level 3 impacted)
Common lessons learned themes
- remote access and identity weaknesses,
- backup exposure,
- OT DMZ acting as a bridge,
- restore sequence unclear.
High-impact improvements
- MFA + session control on all OT access paths
- isolate backups and test restores
- DMZ conduit allowlists; block SMB/RDP broadly
- golden images for jump hosts and key OT servers
- ransomware precursors detection in OT DMZ (backup deletion attempts)
Scenario B: Unauthorized PLC write/download alert (even if false positive)
Common lessons learned themes
- insufficient maintenance window context,
- lack of baselines for controller logic,
- engineering workstation sprawl,
- monitoring lacks protocol operation detail.
High-impact improvements
- baseline “golden” logic/config for critical controllers
- controller write/download detections with asset role context
- limit controller programming rights to approved engineering stations
- add change ticket integration or maintenance window tagging
- improve engineering workstation hardening and rebuild procedure
Scenario C: Vendor remote access abuse or ambiguity
Common lessons learned themes
- vendor accounts shared or unmanaged,
- unclear approvals and after-hours access,
- lack of session logs/recordings,
- overbroad access scope.
High-impact improvements
- per-vendor named accounts + MFA
- per-session approvals and time windows
- target allowlists (which assets vendor can reach)
- session recording + retention policy
- quarterly vendor access review and contract updates
Scenario D: Overreaction caused downtime (containment harmed operations)
Common lessons learned themes
- IT playbooks applied directly to OT,
- unclear authority and approvals,
- no rollback plans for firewall changes.
High-impact improvements
- least-disruptive containment ladder in runbooks
- clear approval matrix and decision log requirement
- emergency rule expiration and rollback procedures
- cross-training SOC on OT operational constraints
- tabletop exercises that simulate “containment vs uptime” tradeoffs
OT lessons learned checklists (copy/paste)
Checklist 1: Hotwash readiness (before the meeting)
- Incident scope defined (sites/zones/assets/time range)
- Initial timeline drafted with timezone
- Evidence locations recorded (SIEM case, OT NDR case, log exports)
- Attendees confirmed (SOC, OT controls, operations, OT network, IAM, vendor as needed)
- Decision log captured (what was approved, by whom)
- Immediate fixes identified (0–14 days)
Checklist 2: Root cause review readiness
- Timeline validated and clock skew noted
- Entry vector hypothesis supported with evidence
- Pivot path documented (remote access → DMZ → L3 → L2 if applicable)
- Asset roles and criticality mapped (what mattered most)
- Recovery steps documented (order and pain points)
- Draft root causes and contributing factors prepared
- Draft corrective actions written with acceptance criteria
Checklist 3: Corrective action quality control
For every action, confirm:
- Specific and testable (not vague)
- Operationally feasible (has a change window plan)
- Has one accountable owner
- Has a due date and milestones
- Includes verification method (how you prove success)
- Includes rollback plan if risky
- Captures site vs fleet scope clearly
Checklist 4: Closure verification (don’t skip)
- Control implemented (configuration or process change)
- Verification completed (test results, logs, screenshots)
- Runbooks updated
- Detections tuned and validated
- Exceptions tracked with expiry
- Metrics updated (baseline vs new performance)
FAQ
What’s the difference between a hotwash and a lessons learned review in OT?
A hotwash is a fast debrief within 24–72 hours to capture facts, decisions, and immediate fixes. The deeper lessons learned review (within 2–4 weeks) focuses on root causes and produces a corrective action plan with owners, deadlines, and verification.
How do you avoid blame-focused postmortems in OT?
Focus on decision points, evidence quality, and systemic control gaps (remote access, conduits, privilege, monitoring, recovery readiness). Use “what conditions made this likely?” instead of “who caused it?”
What OT security improvements usually reduce risk the most?
The biggest repeat winners are: hardening remote access (MFA, jump hosts, session approvals), tightening OT DMZ and conduit allowlists, improving identity and privilege hygiene, protecting and testing backups, and deploying OT-aware detection (controller writes/downloads, new talkers, cross-zone activity).
How do you prioritize OT corrective actions without endless debate?
Score actions by consequence reduction, likelihood reduction, feasibility, time-to-value, and blast radius. Then publish the action register with owners, due dates, and acceptance criteria so decisions become trackable commitments.
How do you prove your lessons learned program is working?
Track recurrence rate within 90 days, time to validate operational impact, time to contain at the boundary, restore test success rate, and on-time closure rate for corrective actions—along with reduced exception aging and better incident context completeness.
