Everything you need to understand how organizations detect, contain, investigate, and recover from cyberattacks — the frameworks, tools, team structures, playbooks, and metrics that define the modern IR discipline.
Starting from zero — no prior experience assumed.
An incident is any event that actually or potentially compromises the confidentiality, integrity, or availability of information or systems. A successful ransomware attack: incident. A phishing email that an employee clicked and credentials were stolen: incident. A misconfigured cloud storage bucket that exposed customer data: incident. A failed login attempt: not an incident — just an event. The IR team's first job is always to determine: event or incident?
When a breach happens, every minute of delay is money. The IBM Cost of a Data Breach report finds that organizations with tested IR plans contain breaches 54% faster and spend significantly less on recovery. Without a plan, security teams reinvent the wheel under extreme pressure — missing evidence, taking the wrong containment steps, failing to notify the right people on time, and violating regulatory notification deadlines.
Reactive IR is what most people picture — responding to an active attack. Proactive IR is the growing discipline of not waiting for the alarm to sound. It includes threat hunting (actively searching for signs of compromise that haven't triggered alerts), purple teaming (simulating attacks to test response), and tabletop exercises (practicing IR decisions in a safe scenario). Mature organizations invest in both equally.
Two frameworks define how most organizations structure their incident response. Understanding both is essential for any security professional.
The US government's gold standard IR framework. Revised in April 2025 to align with NIST CSF 2.0. Structured as four broad phases that are explicitly cyclical — you may loop back to Detection & Analysis mid-containment if new information surfaces. Widely adopted in federal agencies, regulated industries, and enterprise organizations worldwide.
The practitioner-focused framework from SANS Institute — favored by hands-on IR teams for its more granular, step-by-step breakdown. The difference from NIST is largely notational: SANS splits NIST's combined "Containment, Eradication & Recovery" into three distinct phases, making the steps explicit. The underlying concepts are identical.
| Aspect | NIST SP 800-61 | SANS PICERL |
|---|---|---|
| Origin | US Government (NIST) | SANS Institute (private training org) |
| Number of phases | 4 phases | 6 phases |
| C/E/R handling | Combined into one phase | Three separate distinct phases |
| Cyclical nature | Explicitly cyclical — can loop back | Sequential but iterative in practice |
| Best for | Policy documentation, compliance, federal use | Operational teams, hands-on checklists |
| 2025 update | Rev. 3 aligned with NIST CSF 2.0 | Stable — no major recent update |
| Mandatory for | US federal agencies, many regulated sectors | Not mandated — widely adopted voluntarily |
The four NIST phases — what actually happens in each one and what the team produces.
Everything you do before an incident occurs. Build and train the CSIRT (Computer Security Incident Response Team). Write the Incident Response Plan (IRP) and associated playbooks. Deploy and tune detection tools (SIEM, EDR, NDR). Establish communication trees — who calls who, when, via which channels. Create a jump kit: a pre-staged collection of tools, forensic media, spare hardware, and documentation an analyst can grab and go. Define incident severity classifications so everyone knows what P1 vs. P4 means without debate during the chaos. Conduct tabletop exercises quarterly and full simulations annually. Preparation quality is the single biggest determinant of how well every other phase goes.
An alert fires. The team's first job: is this a real incident or a false positive? If real: what type, what scope, what severity? Detection sources include SIEM alerts, EDR detections, NDR anomalies, user reports ("something weird happened"), threat intelligence matches, and third-party notifications (law enforcement, vendors, security researchers). Analysis involves collecting indicators (IOCs), examining logs and telemetry, determining the attack vector, scoping how many systems are affected, and classifying the incident type and severity. This phase also demands documentation from the first moment — everything observed, every action taken, every decision made, with timestamps. This log becomes the incident record.
Containment stops the bleeding: isolate affected endpoints from the network, block malicious IPs and domains at the firewall, disable compromised accounts, revoke stolen credentials, sinkhole C2 domains. Two containment strategies exist — short-term (fast, possibly disruptive, like pulling a server off the network) and long-term (sustainable, allowing business to continue during extended incidents). Eradication removes the threat: delete malware, patch the exploited vulnerability, close the initial access vector, remove persistence mechanisms (scheduled tasks, registry run keys, backdoors). Recovery restores operations: rebuild affected systems from known-clean images or backups, restore data, reconnect systems to the network with enhanced monitoring, and verify normal operation before declaring recovery complete.
Within two weeks of the incident resolving: hold a structured lessons-learned meeting with everyone who was involved. Answer the key questions: What happened? How was it detected? What was done well? What was slow or missed? What tools were missing? What playbooks were inadequate? Produce an After-Action Report (AAR) documenting the full incident timeline, business impact, root cause, and concrete improvement recommendations. Update playbooks based on gaps found. Feed discovered IOCs and TTPs back into detection rules. Use the findings to justify budget requests. This phase is where organizations get better — treating it as a checkbox is how they get breached the same way twice.
SANS splits C/E/R into three explicit steps, which many hands-on IR teams find clearer during active response.
Building the team, plan, tools, and muscle memory before anything bad happens. SANS emphasizes checklists — the Incident Handler's Handbook includes explicit preparation checklists for Windows and Unix environments covering what to document, what tools to have ready, and what policies to establish in advance.
Detecting that something happened and confirming it's a real incident. Key questions: What triggered the alert? Is this a true positive? What systems are involved? How long has the attacker been present (dwell time)? What data may have been accessed? Identify the initial access vector. This maps to the "Detection & Analysis" phase in NIST.
Limit the damage from expanding further. Isolate affected systems, block attacker access paths, preserve evidence before taking action. SANS explicitly distinguishes short-term containment (immediate emergency actions that might be disruptive) from long-term containment (sustainable controls that allow business to continue while the full eradication is being planned).
Find and eliminate everything the attacker left behind: malware, backdoors, modified files, compromised credentials, persistence mechanisms, rogue accounts. Patch the vulnerability that was exploited. Validate that the threat is completely removed before moving to recovery — incomplete eradication is one of the most common causes of re-infection.
Bring systems back online safely. Restore from known-clean backups. Rebuild compromised systems from scratch where trust cannot be re-established. Monitor restored systems intensively for signs of recurrence for 30–90 days. Validate that business functions are working normally before declaring full recovery. Communicate the "all clear" to stakeholders.
SANS explicitly requires the lessons-learned meeting within two weeks of the incident closing. The output is a formal report. SANS recommends comparing response metrics against previous incidents — did MTTD and MTTR improve? Did previously identified gaps get fixed? Were training investments paying off? This structured reflection is what separates maturing IR programs from stagnant ones.
A CSIRT (Computer Security Incident Response Team) is the organizational structure behind incident response. Here are the key roles.
Owns the incident. Makes the final call on containment strategies, escalation decisions, and external communications. Not necessarily the most technical person — the IC must coordinate across teams, manage communications, and keep the response from becoming chaotic. In large incidents, the IC may not touch a keyboard at all.
Directs the technical investigation. Assigns analysis tasks to team members, interprets findings, builds the attack timeline, and recommends containment and eradication actions to the IC. Needs deep knowledge of the kill chain, attacker TTPs, forensics, and the organization's environment.
L1 analysts handle initial triage — validating alerts and escalating real incidents. L2 analysts dig deeper into confirmed incidents. L3 analysts (senior or threat hunters) handle the most complex cases and build detection improvements. During active incidents, multiple analysts are typically assigned in parallel — one on network logs, one on endpoint forensics, one on timeline reconstruction.
Specializes in collecting and analyzing digital evidence in a forensically sound manner — disk images, memory captures, network captures, log preservation. Ensures chain of custody is maintained so evidence is admissible if law enforcement gets involved. May also lead malware analysis and reverse engineering efforts.
Enriches the investigation with external threat context — which threat actor group uses these TTPs, what other victims have been seen, are these IOCs linked to a known campaign. Feeds IOCs into detection tools and updates threat intel platforms. Helps analysts understand what the attacker was trying to achieve and what they might do next.
Manages internal and external communications during the incident — briefing executives (in non-technical language), coordinating with legal and PR teams on breach notification requirements, liaising with law enforcement if needed, and managing vendor communications. During a major breach this person is one of the most critical — poor communication during an incident can cause as much damage as the breach itself.
Ensures the organization meets its legal notification obligations — GDPR requires notification within 72 hours of becoming aware of a breach; many US states have their own deadlines. Advises on evidence preservation requirements and attorney-client privilege considerations for sensitive IR communications. Engages law enforcement if criminal activity is involved.
Not always formally part of the CSIRT but critical participants during response. They know the environment — where the critical servers live, how the network is segmented, what "normal" looks like in the SIEM. They execute containment actions: disabling accounts in AD, isolating VLANs, deploying firewall rules, and rebuilding systems during recovery.
Many organizations maintain a retainer with an external IR firm (CrowdStrike Services, Mandiant, Unit 42, Secureworks) for surge capacity during major incidents. The retainer means pre-negotiated rates, faster engagement, and pre-established data sharing agreements. For smaller organizations, the external firm may be the primary IR capability rather than a supplement.
Modern incident response is tool-intensive. Here's every major category and what role it plays in the response workflow.
Collects and correlates logs from all sources. The central investigation database. Analysts query the SIEM to trace attacker activity across systems, reconstruct attack timelines, and find the scope of compromise.
Deep endpoint telemetry — every process, file, network connection, registry change. During IR, the EDR provides the attack timeline on each endpoint, enables remote isolation, live terminal access, and file quarantine without a physical visit.
Security Orchestration, Automation and Response. Runs automated response playbooks triggered by SIEM alerts or analyst actions — blocking IPs, isolating endpoints, resetting passwords, creating tickets. Reduces response time from hours to seconds for routine actions.
Acquire and analyze disk images, memory dumps, and network captures. Used to recover deleted files, extract malware from memory, reconstruct user activity, and build evidence chains for legal proceedings.
Aggregate, manage, and operationalize threat intelligence. Match IOCs from the incident against known threat actor infrastructure. Provide context: which group, which campaign, what TTPs they use next, and what other organizations have seen.
Network Detection and Response. Captures and analyzes network traffic — C2 communications, lateral movement, data exfiltration. During IR, full packet captures (pcaps) are invaluable for reconstructing exactly what data left the network and where it went.
Documents every action, finding, and decision throughout the incident lifecycle. Provides the official record for post-incident review, legal proceedings, and insurance claims. Should be accessible to all team members and timestamped automatically.
Safely execute suspicious files to observe their behavior. Static analysis examines the file without running it. Dynamic analysis watches what it does when it runs. Reverse engineering disassembles the code to understand its full capability.
During IR, the VM platform answers: what vulnerabilities exist on the affected systems, which were potentially exploited, and what's the fastest path to patching the initial access vector before the attacker returns.
A playbook is a documented, step-by-step response procedure for a specific type of incident. The difference between a chaotic response and a smooth one is almost entirely whether a good playbook existed and was followed.
Mature IR programs maintain separate playbooks for: ransomware, phishing, business email compromise (BEC), data exfiltration, insider threat, account takeover, DDoS, supply chain compromise, and cloud misconfiguration. Each has unique detection signals, containment steps, evidence to collect, and notification requirements — a one-size-fits-all approach fails under pressure.
Modern SOAR platforms automate the mechanical parts of playbooks — enriching alerts with threat intelligence, isolating endpoints via EDR API, blocking IPs at the firewall, resetting accounts in Active Directory, creating case records, and paging the right people. What took a human analyst 45 minutes of manual steps takes an automated playbook under 60 seconds — critically important during fast-moving ransomware or worm incidents.
A playbook is a high-level, strategic guide for handling an incident type — the overall process flow, decision points, escalation paths. A runbook is a low-level, technical procedure for a specific task within that playbook — step-by-step instructions for how to isolate a Linux server, or how to extract memory from a Windows machine. Playbooks reference runbooks; runbooks are the how-to details.
Digital forensics is the discipline of collecting, preserving, and analyzing digital evidence in a way that maintains its integrity — so it's usable in court, in regulatory proceedings, or in insurance claims.
Every piece of evidence must have a documented, unbroken chain of custody — who collected it, when, how, where it's been stored, and who has accessed it. If the chain of custody is broken, the evidence may be inadmissible in legal proceedings. IR teams must log every evidence handling event from the moment of collection through final disposition.
Digital evidence exists on a spectrum of volatility — some disappears in seconds, some lasts years. The forensic rule: always collect the most volatile evidence first. The order: (1) CPU registers & cache, (2) RAM / memory, (3) Network state & connections, (4) Running processes, (5) Open files, (6) Disk contents, (7) Logs and configuration files. Memory especially — it's gone the moment the machine reboots.
Forensic disk acquisition creates a bit-for-bit copy (image) of a storage device without modifying the original. Analysts examine the image — finding deleted files (which often aren't truly deleted), browsing history, artifact files (prefetch, registry hives, event logs, shellbags), and malware that was installed. Tools: FTK Imager, dd, Autopsy, EnCase, X-Ways.
RAM contains evidence that never touches the disk — encryption keys, plaintext passwords, injected shellcode, process hollowing artifacts, network connections, and running malware that only exists in memory. Memory forensics has become essential as attackers increasingly use fileless malware. The gold standard tool is Volatility — an open-source Python framework for analyzing memory dumps from any OS.
Full packet capture (pcap) allows analysts to reconstruct exactly what data was transmitted — what the attacker downloaded, what data was exfiltrated, what C2 commands were issued. Zeek (formerly Bro) generates rich network metadata logs. Wireshark analyzes individual packet captures. During IR, network forensics often proves data exfiltration occurred (or didn't) — critical for breach notification decisions.
Logs are the most commonly used forensic source — Windows Event Logs, Linux auth logs, web server logs, firewall logs, VPN logs, authentication logs. Key Windows events every IR analyst must know: 4624 (successful logon), 4625 (failed logon), 4648 (logon with explicit credentials), 4688 (process creation), 4720 (account created), 7045 (new service installed). The SIEM is the primary log forensics tool, but raw log access is often needed for deep analysis.
Threat intelligence transforms a raw incident from "something bad happened" to "a specific threat actor used this technique to achieve this objective — and here's what they do next."
IOCs are specific, observable artifacts that indicate a system was involved in a malicious activity. They include: IP addresses (C2 servers, attacker infrastructure), domain names (malicious domains, C2 domains), file hashes (MD5, SHA-256 of malicious files), URLs (phishing pages, payload delivery), email addresses (attacker accounts), registry keys (persistence locations), and mutex names (malware behavioral markers). IOCs are operationalized by loading them into the SIEM and EDR as detection rules — so if any system in the environment communicates with a known-bad IP, an alert fires.
TTPs describe how an attacker operates — the sequence of techniques they use, not just the specific artifacts they leave. MITRE ATT&CK is the universal vocabulary for TTPs — a knowledge base of 14 tactics (Initial Access, Execution, Persistence, Privilege Escalation, etc.) with hundreds of specific techniques under each. Knowing the TTPs of the threat actor in your environment tells you: what they've done, what they're likely to do next, and what detection rules to build. TTPs are much harder for attackers to change than IOCs.
MITRE ATT&CK is the most important knowledge base in modern IR. It catalogs real-world attacker behaviors across 14 tactics and 200+ techniques, with sub-techniques. During an incident, analysts map observed behaviors to ATT&CK techniques — creating a visual "ATT&CK Navigator" heatmap of what the attacker did. This shows what they haven't done yet, enabling proactive hunting for the next phase. Every major security vendor maps their detections to ATT&CK.
The Cyber Kill Chain (Lockheed Martin, 2011) models attacks as seven sequential stages: Reconnaissance → Weaponization → Delivery → Exploitation → Installation → Command & Control → Actions on Objectives. Understanding which kill chain stage the attacker is in during an active incident tells the IR team: how far they've progressed, what evidence to look for, and what containment actions matter most right now. Disrupting any stage stops the attack.
Open source (OSINT): VirusTotal, AlienVault OTX, Abuse.ch, MISP. Commercial feeds: Recorded Future, Mandiant, CrowdStrike, Palo Alto Unit 42. Government: CISA Alerts, FBI Flash Reports, ISAC sharing communities (FS-ISAC for finance, H-ISAC for healthcare). Internal: IOCs discovered in your own previous incidents — often the most actionable intelligence of all.
You can't improve what you don't measure. These are the KPIs every mature IR program tracks.
How long from the first moment of compromise until the organization knows an incident occurred. The global average is 194 days. Organizations with mature detection capabilities reduce this to hours or days.
How long from detection to full containment of the threat. The faster this is, the less data is lost, the fewer systems are compromised, and the lower the overall cost. SOAR automation dramatically reduces MTTR.
How long from detection to the point where the attacker can no longer spread or exfiltrate data. Distinct from MTTR — containment stops the bleeding; full response includes eradication and recovery.
How long an attacker was present in the environment before being detected. The longer the dwell time, the more damage done. Shorter dwell time is the primary goal of proactive threat hunting programs.
What percentage of alerts are false positives. A high FPR means analysts waste time on noise. Good SIEM tuning, RBA, and ML-based alerting reduce FPR without reducing true positive detection.
How long from initial access until the attacker begins moving to other systems. CrowdStrike's adversary intelligence sets a "1-10-60" benchmark: detect in 1 min, investigate in 10 min, contain in 60 min to beat average breakout times.
Each incident type has its own patterns, indicators, playbook requirements, and IR priorities. Here are the most common categories every IR professional must know.
Malware that encrypts files and demands payment for the decryption key. Modern ransomware operations (RaaS — Ransomware as a Service) involve human operators who spend weeks inside the network before deploying encryption — mapping backups, stealing data for double extortion, and maximizing impact. Key IR priorities: isolate immediately, determine dwell time, check backup integrity, identify initial access vector. Recovery without paying requires clean, tested, offline backups.
Phishing delivers malicious links or attachments. Business Email Compromise (BEC) impersonates executives or vendors to trigger fraudulent wire transfers or sensitive data disclosure — costing organizations $50B+ globally. IR focuses on: identifying who clicked, what credentials were harvested, whether email accounts were accessed, whether any financial transactions were initiated, and notifying affected users. Immediate password resets and MFA enforcement are critical containment steps.
An attacker uses stolen, guessed, or phished credentials to access legitimate accounts — bypassing most perimeter defenses since they look like the real user. Signs: logins from new countries or devices, impossible travel, access to unusual resources, large data downloads. IR involves: identifying all sessions active under the compromised account, reviewing all actions taken, revoking all tokens, resetting credentials, and investigating how credentials were obtained.
A current or former employee, contractor, or partner intentionally misuses access — exfiltrating data, sabotaging systems, or facilitating external attackers. IR is complicated by the attacker using legitimate credentials and authorized access paths. UEBA (User and Entity Behavior Analytics) is the primary detection mechanism. IR must balance speed (stopping damage) with sensitivity (avoiding wrongful accusations before evidence is confirmed), and typically involves HR and legal from the first moment.
Data being copied and sent outside the organization — customer records, IP, financial data, credentials. May occur as part of a ransomware attack (double extortion) or as a standalone espionage operation. Key evidence: unusually large outbound transfers, connections to cloud storage services (Mega, Dropbox), DNS tunneling, staged archive files (.zip, .rar) created in unusual locations. Network forensics (pcap analysis) is essential to quantify what was taken.
Attackers compromise a trusted third party (software vendor, MSP, hardware supplier) to reach the real target. SolarWinds (2020) is the defining example — attackers inserted malicious code into a software update delivered to 18,000+ organizations. IR is especially difficult because the initial access vector appears completely legitimate — trusted software from a trusted vendor. Requires broad hunting across all systems that received the compromised update, not just suspicious endpoints.
An IR plan that has never been tested is just a document. These exercises are how organizations build real muscle memory — and find the gaps before attackers do.
A facilitated discussion where the IR team, leadership, legal, comms, and IT walk through a hypothetical incident scenario step by step — without touching any systems. The facilitator presents a scenario ("your SIEM just fired a ransomware alert on three servers in the finance department — what do you do next?") and the team talks through their response. Goals: test decision-making, validate communication paths, identify playbook gaps, and surface misunderstandings about roles. Low-cost, high-value. Should be run quarterly for high-maturity teams.
A collaborative exercise where a Red Team (attackers) simulates a realistic attack while the Blue Team (defenders) attempts to detect and respond. Unlike traditional red teaming (where the red team hides), purple teaming is transparent — both teams share TTPs, detection results, and gaps in real time. The output: concrete detection rule improvements and validated playbooks. Purple teaming is the fastest way to improve detection coverage against specific threat scenarios.
A hands-on simulation where the IR team responds to a simulated incident injected into a test environment (or even the live environment with prior approval). Tests not just decision-making but actual tool usage, response speed (MTTD and MTTR are measured), and the mechanics of containment and eradication. Should include surprise elements that weren't in the original scenario — because real incidents never follow the script. Run annually at minimum.
Every exercise — tabletop, purple team, or real incident — must produce a concrete action list with owners and deadlines. Common outputs: update playbook step 4 to include isolating the print server, add a detection rule for T1566.001 phishing, acquire a memory forensics tool, establish a pre-agreed retainer with an external IR firm, test backup restoration quarterly. Without this loop, exercises are theater. With it, they're how organizations measurably improve.
Incident Response is one of the most in-demand and well-compensated specializations in cybersecurity. Here's the certification landscape and how to build toward it.
Every important IR term from this guide, defined in plain English.