Home IndustryOT Forensics: Investigating Incidents Without Stopping Production

OT Forensics: Investigating Incidents Without Stopping Production

by

OT forensics is the practice of investigating cyber incidents in industrial environments while protecting safety and availability. The safest approach is network-first: capture and analyze traffic at zone conduits (IT/OT boundary, OT DMZ, Level 3-to-Level 2) and correlate it with identity and remote-access logs. For endpoint and controller investigation, use least-disruptive methods—collect logs, configuration snapshots, and engineering project metadata before any reboots or scans. A good OT forensics workflow preserves chain of custody, avoids active probing during production, and focuses on high-value evidence like jump-host sessions, firewall logs, protocol operations (reads vs writes vs downloads), and controller configuration integrity checks.

Why OT forensics is different (and why it’s harder)

OT forensics is not “IT forensics in a plant.” The constraints, evidence types, and risks are fundamentally different.

OT has physical consequences

In IT, a wrong action can cause downtime. In OT, a wrong action can cause:

  • loss of visibility for operators,
  • unexpected process state changes,
  • safety trips (or failure to trip when needed),
  • production losses that cascade for days.

That’s why OT investigation must prioritize least-disruptive evidence collection and require operational approval for actions that could affect the process.

Visibility is uneven

Industrial environments often have:

  • limited endpoint agents (EDR may be unavailable on legacy systems),
  • proprietary protocols and devices,
  • incomplete logs (PLC logs may be sparse; safety systems may be vendor-locked),
  • segmented networks with “air gaps” that aren’t actually air gaps.

So OT forensics often relies on network and identity telemetry more than disk images.

The “asset zoo” makes assumptions dangerous

OT has an unusual mix:

  • Windows HMIs, historians, jump hosts, and app servers,
  • embedded devices (PLCs, RTUs, gateways),
  • safety controllers and DCS components,
  • vendor laptops and temporary engineering stations.

A tool or action that is safe on one host type can be risky on another. Forensics procedures must be asset-role-aware.


What “forensics without stopping production” really means

The phrase sounds like a promise. In reality, it’s a discipline.

It does not mean “no impact ever”

It means you investigate in a way that:

  • avoids introducing new process risk,
  • limits changes to the environment,
  • prefers passive and reversible methods,
  • escalates invasive steps only when consequence is acceptable.

The operational goal: maintain safe operations while reducing uncertainty

OT forensics is successful when it helps the organization:

  • confirm whether an incident is real,
  • understand scope and entry path,
  • contain without overreacting,
  • validate integrity during recovery,
  • prevent recurrence.

The job is as much about decision support as it is about technical artifacts.


OT forensics goals: the questions you must answer

Most OT investigations fail because teams collect “everything” and answer nothing. Start with clear questions.

The core questions (in order)

  1. What happened?
    The observed events, mapped to a timeline.
  2. How did it happen? (initial access + pivot path)
    Remote access abuse? DMZ pivot? Compromised engineering workstation?
  3. What is affected? (scope)
    Which sites/zones/cells, which hosts, which credentials, which services.
  4. What did the attacker do? (actions on objectives)
    Lateral movement, credential theft, data exfiltration, controller interactions.
  5. Is the process integrity impacted?
    Any controller writes, logic downloads, configuration drift, setpoint changes.
  6. Is the attacker still present?
    Persistence mechanisms, backdoors, active sessions, stolen credentials.
  7. What must we fix so it doesn’t happen again?
    Remote access controls, segmentation, identity hardening, monitoring gaps.

OT-specific “must answer” items

In OT, you often need to answer extra questions:

  • Did anything touch Level 2 controller networks?
  • Were there any controller write/download operations?
  • Did we lose visibility because monitoring/logging was weak?
  • If we restore, how do we prove we restored correctly?

Principles of OT-safe investigation

These principles should be written into your OT IR/forensics SOP and reinforced in exercises.

1) Network-first, endpoint-second, controller-last

Start where evidence is high value and low risk:

  1. boundary and conduit logs/PCAP
  2. OT DMZ / Level 3 Windows servers
  3. engineering workstations and jump hosts
  4. controllers/safety systems (with OEM guidance)

2) Minimize change; maximize observability

Forensics should not be “try things.” It should be:

  • capture evidence,
  • validate hypotheses,
  • contain safely,
  • preserve chain of custody.

3) Use the least disruptive method that answers the question

If firewall logs answer “who talked to whom,” don’t scan the subnet to find it.

4) Make time a first-class artifact

A single, accurate timeline across:

  • firewall events,
  • VPN/jump host sessions,
  • Windows logs,
  • OT protocol operations,

is often the most valuable forensic output.

5) Separate “reliability anomalies” from “security events”

OT networks have quirks: flaky links, retransmits, malformed frames. Distinguish:

  • network quality issues (route to OT network/reliability), from
  • malicious behavior (route to incident response).

The OT forensics lifecycle: from alert to evidence-backed conclusions

A practical lifecycle keeps the investigation moving without forcing disruptive actions.

Phase 1: Intake and triage

Objective: confirm the alert represents a meaningful event and decide investigation scope.

Deliverables:

  • incident ticket with site/zone/asset role
  • initial hypothesis (e.g., “remote access abuse into OT DMZ”)
  • decision on containment priority (remote access, conduit firewall, host isolation)

Phase 2: Rapid scoping (passive sources first)

Objective: determine whether the event is isolated or spreading.

Deliverables:

  • list of affected hosts/accounts
  • list of impacted conduits (IT/OT boundary, DMZ, L3-to-L2)
  • evidence package v1 (logs + PCAP references + screenshots)

Phase 3: Deep investigation

Objective: establish entry vector, attacker path, and actions.

Deliverables:

  • reconstructed timeline
  • intrusion path diagram (textual or visual)
  • persistence findings
  • integrity findings (especially for engineering and controller interactions)

Phase 4: Recovery validation support

Objective: prove that restored systems are clean and the process is trustworthy.

Deliverables:

  • post-rebuild validation checklist results
  • controller config/logic integrity confirmation (where applicable)
  • “heightened monitoring window” parameters

Phase 5: Lessons learned and hardening requirements

Objective: convert findings into durable controls and detection improvements.

Deliverables:

  • root cause statement (technical + process)
  • prioritized remediation list
  • detection gaps and tuning plan

Evidence sources that don’t disrupt production (best-first)

When you want “forensics without downtime,” these are your highest leverage sources.

Tier 1 (best): Choke-point evidence

These typically provide broad scope with minimal risk:

  • IT/OT boundary firewall logs (allow/deny, NAT, rule hits)
  • OT DMZ firewall logs (DMZ ↔ Level 3 conduits)
  • VPN logs (auth, device posture, IP assignments)
  • Jump host / bastion logs (session start/stop, user, target, recordings)
  • Remote access gateways (session approvals, vendor access metadata)
  • NetFlow/IPFIX at key routers (who talked to whom and when)

Tier 2: Passive network visibility inside OT

  • OT NDR/IDS alerts with evidence (PCAP snippets, protocol decode)
  • SPAN/TAP captures at L3 and select L2 conduits
  • switch port mirroring for affected cells (with careful capacity planning)

Tier 3: Host artifacts on OT-adjacent Windows systems

  • Windows Event Logs (Security, System, PowerShell, Task Scheduler)
  • EDR telemetry (if deployed and OT-approved)
  • application logs (historian, SCADA server, remote support tools)
  • file share access logs (if enabled)

Tier 4 (highest risk): Direct device/controller interrogation

  • PLC/DCS project upload/compare
  • configuration exports
  • controller mode and program changes
  • safety system diagnostics (vendor-dependent)

Rule: Tier 4 steps typically require OT engineering involvement and often OEM guidance.


Network-first OT forensics: how to investigate using conduits

If you can only build one OT forensics capability, build this one. Conduits provide the safest, most scalable insight.

Step 1: Start at the boundary: “Did anything try to cross into OT?”

At the IT/OT boundary and OT DMZ conduits, look for:

  • new inbound sessions to OT DMZ jump hosts
  • unusual protocols (SMB/RDP/WinRM) crossing boundaries
  • spikes in denied traffic (scanning, brute force)
  • new destinations in Level 3 address ranges
  • unexpected egress from OT DMZ to the internet

Outcome: a quick “approach map” that tells you whether OT is being targeted or already affected.

Step 2: Pivot on identity: “Who logged in, from where, to what?”

In OT investigations, identity is often the pivot:

  • VPN user → jump host session → target server
  • vendor account → remote tool session → engineering workstation
  • service account → lateral movement across OT servers

Correlate:

  • authentication logs (success + failures),
  • session recording metadata,
  • privileged group membership changes.

Outcome: you can often pinpoint the entry vector without touching production networks.

Step 3: Validate with protocol operations (OT-aware decoding)

Generic “port 502 traffic exists” is not enough. For OT safety and integrity, distinguish:

  • read vs write operations,
  • program download events,
  • controller mode changes,
  • unauthorized connections to engineering services.

Even when ransomware is the headline, these findings answer the core OT question:
“Did anyone touch control?”

Step 4: Build a timeline that aligns across systems

OT environments often have time drift. During investigations:

  • normalize timestamps to a single timezone (document it)
  • record known clock skew (e.g., jump host is 4 minutes ahead)
  • cross-check with time sources (NTP, domain controllers)

Deliverable: a timeline that can withstand executive review, legal review, and engineering scrutiny.

Step 5: Decide containment based on paths, not panic

Network-first forensics supports targeted containment like:

  • terminating specific remote sessions,
  • blocking a single source-to-target pair,
  • tightening a conduit temporarily.

This is how you investigate and contain without shutting down a line.


Endpoint forensics in OT: HMIs, historians, jump hosts, engineering workstations

Endpoints matter in OT—especially Windows systems—but the approach must be cautious.

The OT endpoint priority order (why it’s not “all endpoints”)

In many incidents, the most valuable endpoints are:

  1. Jump hosts/bastions (they capture who accessed OT and when)
  2. OT DMZ servers (file transfer, patch staging, remote access brokers)
  3. Engineering workstations (high privilege, control changes)
  4. Historians/SCADA servers (visibility and operational continuity)
  5. Operator HMIs (important but often numerous and fragile)

What to collect from Windows hosts (low-disruption first)

Prefer artifact collection that does not require rebooting or heavy scanning:

  • Security log: logon events, privilege use, account changes
  • PowerShell logs (if enabled): script execution and command history
  • Task Scheduler: suspicious scheduled tasks
  • Services: new or modified services
  • Autoruns/persistence locations
  • RDP logs: inbound connections, session durations
  • Remote tool logs (AnyDesk, TeamViewer, vendor tools) where present
  • File system metadata for suspicious binaries (hash, timestamps)

If EDR exists and is OT-approved, use it to:

  • pull process trees,
  • capture memory (when safe),
  • isolate only if OT approves and the host is not critical to live operations.

Imaging vs triage collection: what’s realistic in OT?

Full disk imaging is gold standard in IT forensics, but OT constraints often make it impractical.

A pragmatic approach:

  • Triage first (logs + targeted artifact collection)
  • Image selectively (jump hosts, OT DMZ servers, engineering workstations)
  • Avoid imaging on hosts that cannot tolerate performance impact during production

If you must image, schedule during a safe window or use a standby/replica approach.

The engineering workstation (EWS) special case

Treat EWS as “Tier 0 for OT integrity.”

Collect:

  • engineering software logs (project open/download events if available)
  • project file hashes and last modified times
  • USB usage history (common infection vector)
  • recent network connections to controllers
  • local admin group changes
  • any remote access tool activity

Do not “clean it and put it back” if compromise is credible. Rebuild from a known-good image and restore projects from verified backups.


PLC/DCS/Safety forensics: what you can collect safely

Controller forensics is possible, but it’s not like endpoint forensics. Evidence is often sparse and vendor-specific, and the safety consequences of mistakes can be severe.

What “PLC forensics” usually means in practice

Most organizations focus on integrity verification rather than “disk artifacts,” because PLCs don’t store evidence the same way PCs do.

Typical safe objectives:

  • confirm current logic/config matches known-good
  • identify last program download time (if available)
  • identify which engineering station performed changes (if logs exist)
  • verify controller mode (run/program) history (vendor-dependent)

Safe controller evidence collection (common patterns)

With OT engineering involvement:

  • export or snapshot controller configuration (read-only)
  • compare running program to a baseline (golden version)
  • capture controller metadata: firmware version, module list, mode state
  • export alarm/event logs from SCADA/DCS that reflect control changes
  • validate setpoints and critical parameters against approved values

Safety systems require OEM-led discipline

Safety PLCs and SIS components often require:

  • strict change control,
  • vendor procedures for diagnostics,
  • documented integrity checks.

In incident conditions, the goal is not “poke around.” The goal is:

  • verify safety functionality remains intact,
  • preserve evidence through approved methods,
  • avoid actions that invalidate certifications or safety cases.

Active techniques: what to avoid and how to do them safely

Active probing is where OT investigations go wrong. You can still do active techniques—but only with safeguards.

Avoid during production (unless safety demands it)

  • vulnerability scanning across Level 2 networks
  • aggressive port scanning and service enumeration
  • packet flooding tests
  • ad-hoc agent deployments
  • rebooting OT servers “to see what happens”
  • changing firewall rules without rollback plans

If you must do active checks, use OT-safe constraints

Design rules:

  • scope to a single host or small allowlist, not subnets
  • throttle rate limits heavily
  • run during a maintenance window when possible
  • coordinate with controls and operations
  • monitor process indicators during the activity
  • keep rollback ready

Prefer “passive discovery” and “controlled verification”

Instead of scanning, use:

  • switch CAM tables and ARP caches,
  • firewall session tables,
  • OT NDR passive asset discovery,
  • change management records and CMDB.

Chain of custody in OT: simple, defensible, and practical

You don’t need courtroom-level formality for every event, but you do need defensibility—especially when incidents impact production, vendors, or regulatory reporting.

The minimum viable chain-of-custody record

For each evidence item:

  • Unique ID: e.g., OT-INC-2025-12-20-EV-001
  • Collector: name + role
  • Date/time collected: include timezone
  • Source: hostname/IP, system role, location/site
  • Method: export tool, command, screenshot, log pull, PCAP
  • Hash: SHA-256 for files/images when feasible
  • Storage location: case vault path, access controls
  • Transfers/access: who accessed it later and why

Why chain of custody matters even if you “just want to fix it”

Because later you will need to answer:

  • How do we know this timeline is correct?
  • How do we know the backup was not altered?
  • How do we justify that a vendor session caused a change (or didn’t)?
  • How do we support insurance, legal, or compliance requirements?

Building an OT-ready evidence pipeline (architecture)

Forensics without downtime works best when evidence is already flowing—before the incident.

Reference architecture (vendor-neutral)

Goal: high-fidelity evidence at choke points, with safe access and retention.

Core components

  • Conduit firewalls with logging enabled and forwarded to SIEM
  • VPN and jump host with strong authentication + session logging
  • OT NDR/IDS sensors at:
    • IT/OT boundary,
    • OT DMZ,
    • key L3-to-L2 conduits (selective, high-value)
  • Central log collection with:
    • retention policies suitable for investigations,
    • time sync verification,
    • role-based access control
  • Evidence vault:
    • immutable storage for critical cases,
    • standardized case folders and naming

Where to place sensors (practical guidance)

You don’t need sensors everywhere. You need them where they answer the most questions.

High ROI placements:

  • IT/OT boundary (detect approach and attempted pivots)
  • OT DMZ internal segments (detect lateral movement within DMZ)
  • L3-to-L2 conduits for critical cells/lines (detect new talkers and risky ops)
  • remote access/jump host (identity truth source)

Retention: the common “we can’t investigate” failure mode

OT incidents are often discovered late (weekends, after slow lateral movement). If logs roll over in 7 days, you’re blind.

Practical targets:

  • boundary/firewall logs: 30–180 days depending on capacity
  • VPN/jump logs: 90–180 days (longer if feasible)
  • OT NDR metadata: 30–90 days minimum
  • PCAP: selective retention (ring buffer) plus on-demand capture for incidents

Playbooks: common OT incident investigations

Below are investigations you can run mostly passively, without stopping production.

Playbook 1: Suspected ransomware near OT

Primary question: is it contained to IT/DMZ, or is OT at risk?

Evidence-first steps

  1. Boundary firewall logs: new sessions from IT to OT DMZ? denied spikes?
  2. VPN/jump logs: suspicious logins, new devices, off-hours access
  3. OT DMZ server telemetry: file encryption patterns, service disruptions
  4. OT NDR: any scanning/new talkers toward L3/L2?
  5. Confirm with operations: any loss of HMI/SCADA visibility?

Containment guidance

  • contain remote access and boundary pathways first
  • protect backups
  • avoid Level 2 disruption unless evidence shows spread

Key output

  • “Approach map” and “blast radius map” with confidence levels

Playbook 2: Unauthorized PLC write or logic download alert

Primary question: did an unauthorized change occur, and from where?

Evidence-first steps

  1. OT NDR decode: confirm write/download operation vs normal reads
  2. Identify source host: is it a known EWS? vendor laptop? unknown node?
  3. Jump host session correlation: did that user session exist at that time?
  4. Controller integrity check: compare running program/config to baseline
  5. Change management check: was there an approved work order?

Containment guidance

  • terminate suspicious sessions and block source-to-controller at conduit firewall
  • do not reboot controllers as a “reset”
  • rebuild compromised engineering endpoints rather than cleaning them

Key output

  • a defensible timeline and “source attribution” at the session level

Playbook 3: Compromised vendor remote access suspicion

Primary question: was vendor access abused, and what did it touch?

Evidence-first steps

  1. Vendor access logs: who, when, from where (device/IP), MFA status
  2. Jump host recordings/session metadata: targets accessed, commands if recorded
  3. Firewall logs: session destinations beyond approved targets?
  4. OT NDR: any controller operations during the session?
  5. Verify with OT: was access scheduled/approved?

Containment guidance

  • time-box and re-approve vendor access per session
  • enforce MFA and restrict vendor accounts to specific targets/hours
  • rotate credentials if compromise suspected

Key output

  • a clear answer to “what the vendor account did and did not do”

Playbook 4: “New device discovered” or suspected rogue asset in a cell

Primary question: is it real and is it risky?

Evidence-first steps

  1. Switchport evidence: MAC address on which port/VLAN?
  2. DHCP logs (if used) or ARP tables for IP mapping
  3. OT NDR fingerprints: device type inference
  4. Communication pattern: who is it talking to? any controllers?
  5. Physical verification: coordinate with OT to locate device safely

Containment guidance

  • avoid aggressive scanning
  • isolate at the switchport if confirmed rogue and safe to do so
  • document as an asset management/control gap if legitimate

Key output

  • identity confidence score and risk-based recommendation

Reporting: how to write findings OT and executives will trust

OT stakeholders and executives judge investigations by clarity and consequence, not by tool output.

Structure your report for decision-making

A good OT forensics report includes:

  1. Executive summary
    • what happened
    • whether production/safety was impacted
    • what was contained and how
    • current risk status (attacker present or removed)
  2. Operational impact section (plain language)
    • affected sites/lines
    • downtime or near-miss impacts
    • safety implications and validation performed
  3. Timeline (single source of truth)
    • timestamps + timezone
    • correlated events across VPN/jump/firewall/EDR/OT NDR
  4. Intrusion path narrative
    • initial access vector
    • pivot points (IT → DMZ → L3 → L2)
    • credentials used and privilege changes
  5. Technical findings and evidence list
    • artifact IDs and storage location
    • hashes where applicable
    • screenshots references
  6. Containment and recovery actions
    • who approved what
    • reversibility and expiration of emergency rules
  7. Root cause and contributing factors
    • technical cause (e.g., stolen credentials, exposed service)
    • process cause (e.g., vendor access not time-boxed, weak segmentation)
  8. Remediation plan
    • prioritized actions with owners and dates
    • compensating controls where patching isn’t possible

Avoid two common reporting failures

  • Tool dumps: raw logs without conclusions
  • Over-certainty: claiming attribution or integrity guarantees without evidence

Use confidence language:

  • Confirmed / Likely / Possible / Unconfirmed

Metrics: proving your OT forensics program works

Your goal isn’t “collect more data.” It’s “answer critical questions faster and safer.”

High-value OT forensics KPIs

  • Time to scope (from alert to affected assets/conduits identified)
  • Time to build an initial timeline (first reliable timeline draft)
  • Time to validate operational context (expected vs unexpected change)
  • % incidents with session-level attribution (user/session → target mapping)
  • % incidents with controller integrity validation when applicable
  • Evidence completeness rate (minimum viable evidence package collected)
  • Production disruption incidents caused by investigation (should trend to zero)

A practical maturity indicator

A mature OT forensics capability can answer, within hours:

  • whether the incident reached Level 2,
  • whether any controller writes/downloads occurred,
  • which remote sessions were involved,
  • what to contain without shutting down production.

OT forensics checklists (copy/paste)

Checklist A: First 30 minutes (OT-safe forensic triage)

  •  Confirm site/zone/cell and asset roles involved
  •  Preserve boundary firewall logs for the time window
  •  Pull VPN and jump host auth/session logs (export + store)
  •  Capture OT NDR/IDS evidence for the alert (PCAP summary, decode)
  •  Identify suspected source host and user/session
  •  Confirm maintenance window / approved work order status
  •  Decide containment at the least disruptive layer (remote access, conduit firewall)
  •  Start a unified incident timeline with timezone and known clock drift

Checklist B: Minimum viable evidence package (per incident)

  •  Firewall logs: IT/OT boundary + OT DMZ conduits
  •  VPN logs: login attempts, device/IP assignment, MFA events
  •  Jump host logs: session start/stop, target mapping, recordings IDs
  •  OT NDR evidence: protocol ops (read/write/download), new talkers, scanning
  •  Windows logs from key servers (jump host, OT DMZ servers, OT AD if present)
  •  EDR timeline exports (if available and safe)
  •  Backup system logs (especially for ransomware suspicion)
  •  Evidence register: IDs, hashes, storage locations, access log

Checklist C: Controller integrity validation (when needed)

  •  Identify controller(s) by role and criticality (PLC vs safety PLC)
  •  Confirm process state with operations (safe to validate now?)
  •  Capture controller metadata (firmware, mode, module list)
  •  Compare running logic/config to baseline (golden version)
  •  Verify setpoints/interlocks relevant to the process area
  •  Document OEM procedures followed and who performed them
  •  Record findings with confidence level and supporting artifacts

Checklist D: “What NOT to do” guardrails

  •  No active scanning in Level 2 during production without OT approval
  •  No reboots of controllers/safety systems without operations plan
  •  No mass isolation/quarantine automation targeting OT assets
  •  No wiping/reimaging before minimum evidence package is captured
  •  No emergency firewall changes without rollback + expiration tracking

FAQ

Can you do digital forensics in OT without taking systems offline?

Often yes—especially by using network-first forensics (firewall logs, VPN/jump host sessions, OT NDR protocol evidence). Deep endpoint imaging and controller-level validation may require maintenance windows, but many investigations can scope and contain incidents without downtime.

What’s the safest starting point for an OT investigation?

Start at choke points: IT/OT boundary and OT DMZ firewall logs, plus VPN/jump host session logs. These provide high-value scope and attribution with minimal operational risk.

Is active scanning safe in OT networks?

Not by default. Active scanning can disrupt fragile devices and time-sensitive communications. If scanning is required, it should be tightly scoped, rate-limited, coordinated with OT, and preferably performed during a maintenance window.

What evidence is most important for OT ransomware investigations?

Session and pathway evidence: VPN logs, jump host session metadata/recordings, boundary firewall logs, and OT DMZ server telemetry—plus backup system logs to confirm whether backup destruction was attempted.

How do you prove controller integrity after an incident?

By comparing running controller logic/configuration against a known-good baseline, validating critical setpoints and interlocks, and documenting OEM-approved procedures. The goal is defensible integrity validation, not “guessing it’s fine.”

You may also like