AI & NLP for Threat Intelligence (2025): Automate IOC/TTP Extraction, Summaries & ATT&CK Mapping By CyberDudeBivash • September 21, 2025 (IST)

TL;DR 

  • What you’ll build: an end-to-end CTI pipeline that ingests reports/feeds → extracts IOCs & TTPs → normalizes/dedupes → maps to MITRE ATT&CK → publishes STIX 2.1 to your TIP (MISP/OpenCTI) and pushes detections to SIEM/SOAR. ATT&CK is your lingua franca for adversary behavior. MITRE ATT&CK+1
  • Why now: mature building blocks exist—spaCy/HuggingFace for NER, STIX/TAXII 2.1 for exchange, MISP/OpenCTI for knowledge graphs, ATT&CK Navigator for coverage views. MITRE ATT&CK+5spacy.io+5Hugging Face+5
  • Business win: shrink report-to-detection from days to minutes; measure precision/recall on extractions and coverage deltas per ATT&CK technique. (Use CISA’s mapping practices to keep analysts honest.) CISA

1) What problems AI actually solves in CTI

  • Speed: OCR/PDF → clean text → IOC/TTP extraction and entity linking at stream speed.
  • Normalization: inconsistent formats → STIX 2.1 objects (Indicator, Malware, Intrusion Set, Relationship). OASIS Open+1
  • Prioritization: summarize long reports; rank IOCs by observed-in and confidence; map to your detection gaps using ATT&CK. MITRE ATT&CK
  • Distribution: auto-publish via TAXII 2.1 to TIPs and subscribers. docs.oasis-open.org+1

2) Reference pipeline 

Ingest → Parse → NER/IOC extract → Validate → Normalize & De-dup → TTP extraction → ATT&CK mapping → STIX 2.1 pack → TAXII publish → SIEM/SOAR actions

2.1 Ingest & parsing

  • Accept PDF/HTML/blog/TWITTER/X feeds. Strip boilerplate; preserve line breaks for pattern-based cues (e.g., command blocks).

2.2 IOC extraction (NER + rules)

  • Use spaCy (fast, customizable) + Hugging Face token-classification models for domain/IP/hash/URL/CVE tags; backstop with regex/heuristics for high-precision patterns. spacy.io+1
  • Validate with shape checks (IPv4/6, TLD list), sinkhole typo-squats, and active DNS lookups (quarantined).

2.3 TTP extraction (behavior → techniques)

  • Pattern library for common textual cues → ATT&CK techniques; e.g., “mimikatz/lsass dump” → Credential Dumping (T1003); “regsvr32 /s /u /i:http” → Signed Binary Proxy Execution (T1218). Use ATT&CK technique pages as your source of truth. MITRE ATT&CK
  • Apply weak/medium/strong mapping rules and keep analyst review in the loop (see §5).

2.4 Normalize & de-dup

  • Canonicalize domains (evil[.]com → evil.com), hashes, and CVEs; merge by observable keys; attach source and confidence.

2.5 Package & publish

2.6 Close the loop

  • Use ATT&CK Navigator layers to visualize what techniques the intel covers vs your detections. Feed gaps to your SIEM/SOAR backlog. MITRE ATT&CK

3) Minimal working example (Python)

3.1 Extract IOCs with spaCy + Transformers

# pip install spacy transformers rapidfuzz tldextract
import re, tldextract, hashlib
from rapidfuzz import fuzz
from transformers import pipeline

ner = pipeline("token-classification", model="dslim/bert-base-NER")  # HF example

IOC_PATTERNS = {
    "ip": re.compile(r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d?\d)(?:\.|$)){4}\b"),
    "sha256": re.compile(r"\b[A-Fa-f0-9]{64}\b"),
    "md5": re.compile(r"\b[A-Fa-f0-9]{32}\b"),
    "url": re.compile(r"\bhttps?://[^\s)]+")
}

def extract_iocs(text: str):
    out = {"ip": set(), "hash": set(), "url": set(), "domain": set(), "cve": set()}
    # Rule-based
    for k, pat in IOC_PATTERNS.items():
        for m in pat.findall(text):
            if k == "ip": out["ip"].add(m.strip("."))
            elif k in ("sha256", "md5"): out["hash"].add(m.lower())
            elif k == "url": out["url"].add(m)
    # Domains from URLs and plain text
    for u in list(out["url"]):
        ext = tldextract.extract(u)
        if ext.domain and ext.suffix:
            out["domain"].add(f"{ext.domain}.{ext.suffix}".lower())
    # Lightweight CVE
    out["cve"].update(re.findall(r"CVE-\d{4}-\d{4,7}", text, flags=re.I))
    return {k: sorted(v) for k, v in out.items()}

(Hugging Face “token-classification”/NER pipeline & docs shown for reference.) Hugging Face+1

3.2 Map text snippets to ATT&CK techniques (heuristics)

ATTACK_RULES = [
  (r"mimikatz|sekurlsa|lsass", "T1003"),          # Credential Dumping
  (r"regsvr32.*(http|https)", "T1218.010"),       # Regsvr32 proxy exec
  (r"powershell.*-enc", "T1059.001"),             # PowerShell
  (r"rundll32.*url|dllhost.*url", "T1218"),       # Signed Binary Proxy Exec
  (r"certutil.*-urlcache|-decode", "T1105"),      # Ingress Tool Transfer
]

def map_ttps(text: str):
    hits = {}
    for pat, tech in ATTACK_RULES:
        if re.search(pat, text, flags=re.I):
            hits[tech] = hits.get(tech, 0) + 1
    return [{"technique": t, "evidence": f"rule:{n}"} for t, n in hits.items()]

(Use MITRE ATT&CK technique catalog to validate mappings & keep rules refreshed.) MITRE ATT&CK

3.3 Emit a STIX 2.1 Indicator bundle (simplified)

import json, uuid, datetime as dt

def stix_indicator(ioc: str, ioc_type: str, label="malicious-activity", conf=70):
    return {
      "type": "indicator",
      "spec_version": "2.1",
      "id": f"indicator--{uuid.uuid4()}",
      "created": dt.datetime.utcnow().isoformat()+"Z",
      "modified": dt.datetime.utcnow().isoformat()+"Z",
      "name": f"{ioc_type}:{ioc}",
      "pattern_type": "stix",
      "pattern": f"[{ioc_type}:value = '{ioc}']",
      "confidence": conf,
      "labels": [label]
    }

def stix_bundle(iocs):
    objs = []
    for ip in iocs["ip"]:
        objs.append(stix_indicator(ip, "ipv4-addr"))
    for d in iocs["domain"]:
        objs.append(stix_indicator(d, "domain-name"))
    for h in iocs["hash"]:
        objs.append(stix_indicator(h, "file:hashes.'SHA-256'"))  # adapt per hash type
    return {"type": "bundle", "id": f"bundle--{uuid.uuid4()}", "objects": objs}

# Example
# text = open("report.txt").read()
# bundle = stix_bundle(extract_iocs(text))
# print(json.dumps(bundle, indent=2))

(STIX 2.1 is the current exchange standard for CTI; see the OASIS spec & examples.) OASIS Open+1

3.4 TAXII 2.1 publish (conceptual)

  • POST your bundle to a TAXII 2.1 collections/{id}/objects endpoint with an API token. (See OASIS TAXII 2.1 for REST details.) docs.oasis-open.org
  • On the receiving end, MISP or OpenCTI ingests and enriches (sightings, relationships, graph). misp-project.org+1

4) Integrations that matter (and why)

LayerToolingWhy it helps
TIPOpenCTIMISPKnowledge graphing, STIX in/out, connectors, collaboration. docs.opencti.io+1
ExchangeSTIX 2.1 / TAXII 2.1Vendor-neutral, standards-based sharing/publishing. OASIS Open+1
Mapping/coverageMITRE ATT&CK + NavigatorNormalized TTPs and visualization of detection gaps. MITRE ATT&CK+1
ExtractionspaCyTransformers (HF)Production-grade NER + customizable models. spacy.io+1

5) Human-in-the-loop (HITL) keeps you honest

  • Analyst review gates: promote items to “published” only after a short check of precision (especially TTP mappings).
  • CISA’s ATT&CK mapping guidance: avoid “wishful mapping” and biases; require evidence strings linking text to technique IDs. CISA
  • Feedback loops: false positives go back to training (regex tweaks, prompt updates, model fine-tuning).

6) Quality & ROI: measure these, or it didn’t happen

  • Extraction P/R/F1 for IOCs & TTPs (label 200–500 sentences; update quarterly).
  • Latency: ingest→publish p50/p95.
  • Coverage delta: techniques with active detections before vs after intel import (Navigator layer diff). MITRE ATT&CK
  • SOC impact: time saved per case, auto-enrichment hit rate, ratio of auto-closed low-risk alerts.
  • Cost to value: GPU/CPU time vs analyst hours saved.

7) Production safeguards

  • Confidence scoring & source weighting (vendor reputation, age, sightings).
  • De-dup & decay: older IOCs auto-downgrade unless re-sighted.
  • Toxic data filters: block “copy-pasted” attack chains from Reddit/unknown gists without corroboration.
  • Tenant-aware exports: separate workforce vs customer intel where licensing requires it.

8) 30/60/90-day rollout

Days 1–30 (Pilot)

  • Stand up OpenCTI or MISP; wire TAXII input, attach a small set of trusted sources. docs.opencti.io+1
  • Ship IOC extraction + basic ATT&CK heuristics; publish STIX 2.1 to a sandbox collection. OASIS Open
  • Start a 200-sentence golden set for evaluation.

Days 31–60 (Harden)

  • Add HITL UI, confidence tiers, and auto-dedup; enrich with WHOIS/passive DNS; auto-create Navigator layers for coverage reviews. MITRE ATT&CK
  • Begin SIEM/SOAR wiring: blocklists for high-confidence IOCs; analytics for common techniques.

Days 61–90 (Operate)

  • Expand TTP rules; add model fine-tuning for domain-specific jargon; schedule weekly metrics; open TAXII to internal consumers. docs.oasis-open.org

9) Playbooks 

IOC → Action (high-confidence)

  1. Publish STIX Indicator (+ Sighting if seen).
  2. Create SOAR task to block (URL/IP/hash) and hunt last 30 days.
  3. Expire after N days without sightings.

TTP → Action

  1. Add ATT&CK technique to Navigator; check detection gapMITRE ATT&CK
  2. If gap exists: create SIEM rule/sigma/JEA script task.
  3. Backfill search & case.

10) Build vs buy (fast guidance)

  • Buy platform; build extractors. Most teams win with a commercial/open TIP + custom NLP on top.
  • Red flags: no STIX/TAXII, no ATT&CK alignment, black-box ML without feedback loops, no export to SIEM/SOAR.

FAQs

Is LLM summarization safe for CTI?
Yes—with prompt constraintssource citations, and a human approval step for high-impact summaries.

Why not rely only on regex?
Rules give precision; ML adds recall and generalizes to unseen formats. Use both.

Can we auto-map techniques?
Use weak/strong evidence tiers + analyst review. CISA’s paper highlights common mapping errors—treat it as policy. CISA


Sources & primers

#CyberDudeBivash #ThreatIntelligence #NLP #AI #CTI #IOC #TTP #MITREATTACK #STIX #TAXII #MISP #OpenCTI #SIEM #SOAR #XDR #SOCAutomation #OSINT #Summarization #EntityRecognition

Leave a comment

Design a site like this with WordPress.com
Get started