
TL;DR
- What you’ll build: an end-to-end CTI pipeline that ingests reports/feeds → extracts IOCs & TTPs → normalizes/dedupes → maps to MITRE ATT&CK → publishes STIX 2.1 to your TIP (MISP/OpenCTI) and pushes detections to SIEM/SOAR. ATT&CK is your lingua franca for adversary behavior. MITRE ATT&CK+1
- Why now: mature building blocks exist—spaCy/HuggingFace for NER, STIX/TAXII 2.1 for exchange, MISP/OpenCTI for knowledge graphs, ATT&CK Navigator for coverage views. MITRE ATT&CK+5spacy.io+5Hugging Face+5
- Business win: shrink report-to-detection from days to minutes; measure precision/recall on extractions and coverage deltas per ATT&CK technique. (Use CISA’s mapping practices to keep analysts honest.) CISA
1) What problems AI actually solves in CTI
- Speed: OCR/PDF → clean text → IOC/TTP extraction and entity linking at stream speed.
- Normalization: inconsistent formats → STIX 2.1 objects (Indicator, Malware, Intrusion Set, Relationship). OASIS Open+1
- Prioritization: summarize long reports; rank IOCs by observed-in and confidence; map to your detection gaps using ATT&CK. MITRE ATT&CK
- Distribution: auto-publish via TAXII 2.1 to TIPs and subscribers. docs.oasis-open.org+1
2) Reference pipeline
Ingest → Parse → NER/IOC extract → Validate → Normalize & De-dup → TTP extraction → ATT&CK mapping → STIX 2.1 pack → TAXII publish → SIEM/SOAR actions
2.1 Ingest & parsing
- Accept PDF/HTML/blog/TWITTER/X feeds. Strip boilerplate; preserve line breaks for pattern-based cues (e.g., command blocks).
2.2 IOC extraction (NER + rules)
- Use spaCy (fast, customizable) + Hugging Face token-classification models for domain/IP/hash/URL/CVE tags; backstop with regex/heuristics for high-precision patterns. spacy.io+1
- Validate with shape checks (IPv4/6, TLD list), sinkhole typo-squats, and active DNS lookups (quarantined).
2.3 TTP extraction (behavior → techniques)
- Pattern library for common textual cues → ATT&CK techniques; e.g., “mimikatz/lsass dump” → Credential Dumping (T1003); “regsvr32 /s /u /i:http” → Signed Binary Proxy Execution (T1218). Use ATT&CK technique pages as your source of truth. MITRE ATT&CK
- Apply weak/medium/strong mapping rules and keep analyst review in the loop (see §5).
2.4 Normalize & de-dup
- Canonicalize domains (
evil[.]com→evil.com), hashes, and CVEs; merge by observable keys; attach source and confidence.
2.5 Package & publish
- Emit STIX 2.1 Indicator + Sighting + Relationship objects; push via TAXII 2.1 to MISP or OpenCTI; both speak STIX and have broad integrations. docs.opencti.io+3OASIS Open+3OASIS Open+3
2.6 Close the loop
- Use ATT&CK Navigator layers to visualize what techniques the intel covers vs your detections. Feed gaps to your SIEM/SOAR backlog. MITRE ATT&CK
3) Minimal working example (Python)
3.1 Extract IOCs with spaCy + Transformers
# pip install spacy transformers rapidfuzz tldextract
import re, tldextract, hashlib
from rapidfuzz import fuzz
from transformers import pipeline
ner = pipeline("token-classification", model="dslim/bert-base-NER") # HF example
IOC_PATTERNS = {
"ip": re.compile(r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d?\d)(?:\.|$)){4}\b"),
"sha256": re.compile(r"\b[A-Fa-f0-9]{64}\b"),
"md5": re.compile(r"\b[A-Fa-f0-9]{32}\b"),
"url": re.compile(r"\bhttps?://[^\s)]+")
}
def extract_iocs(text: str):
out = {"ip": set(), "hash": set(), "url": set(), "domain": set(), "cve": set()}
# Rule-based
for k, pat in IOC_PATTERNS.items():
for m in pat.findall(text):
if k == "ip": out["ip"].add(m.strip("."))
elif k in ("sha256", "md5"): out["hash"].add(m.lower())
elif k == "url": out["url"].add(m)
# Domains from URLs and plain text
for u in list(out["url"]):
ext = tldextract.extract(u)
if ext.domain and ext.suffix:
out["domain"].add(f"{ext.domain}.{ext.suffix}".lower())
# Lightweight CVE
out["cve"].update(re.findall(r"CVE-\d{4}-\d{4,7}", text, flags=re.I))
return {k: sorted(v) for k, v in out.items()}
(Hugging Face “token-classification”/NER pipeline & docs shown for reference.) Hugging Face+1
3.2 Map text snippets to ATT&CK techniques (heuristics)
ATTACK_RULES = [
(r"mimikatz|sekurlsa|lsass", "T1003"), # Credential Dumping
(r"regsvr32.*(http|https)", "T1218.010"), # Regsvr32 proxy exec
(r"powershell.*-enc", "T1059.001"), # PowerShell
(r"rundll32.*url|dllhost.*url", "T1218"), # Signed Binary Proxy Exec
(r"certutil.*-urlcache|-decode", "T1105"), # Ingress Tool Transfer
]
def map_ttps(text: str):
hits = {}
for pat, tech in ATTACK_RULES:
if re.search(pat, text, flags=re.I):
hits[tech] = hits.get(tech, 0) + 1
return [{"technique": t, "evidence": f"rule:{n}"} for t, n in hits.items()]
(Use MITRE ATT&CK technique catalog to validate mappings & keep rules refreshed.) MITRE ATT&CK
3.3 Emit a STIX 2.1 Indicator bundle (simplified)
import json, uuid, datetime as dt
def stix_indicator(ioc: str, ioc_type: str, label="malicious-activity", conf=70):
return {
"type": "indicator",
"spec_version": "2.1",
"id": f"indicator--{uuid.uuid4()}",
"created": dt.datetime.utcnow().isoformat()+"Z",
"modified": dt.datetime.utcnow().isoformat()+"Z",
"name": f"{ioc_type}:{ioc}",
"pattern_type": "stix",
"pattern": f"[{ioc_type}:value = '{ioc}']",
"confidence": conf,
"labels": [label]
}
def stix_bundle(iocs):
objs = []
for ip in iocs["ip"]:
objs.append(stix_indicator(ip, "ipv4-addr"))
for d in iocs["domain"]:
objs.append(stix_indicator(d, "domain-name"))
for h in iocs["hash"]:
objs.append(stix_indicator(h, "file:hashes.'SHA-256'")) # adapt per hash type
return {"type": "bundle", "id": f"bundle--{uuid.uuid4()}", "objects": objs}
# Example
# text = open("report.txt").read()
# bundle = stix_bundle(extract_iocs(text))
# print(json.dumps(bundle, indent=2))
(STIX 2.1 is the current exchange standard for CTI; see the OASIS spec & examples.) OASIS Open+1
3.4 TAXII 2.1 publish (conceptual)
- POST your bundle to a TAXII 2.1 collections/{id}/objects endpoint with an API token. (See OASIS TAXII 2.1 for REST details.) docs.oasis-open.org
- On the receiving end, MISP or OpenCTI ingests and enriches (sightings, relationships, graph). misp-project.org+1
4) Integrations that matter (and why)
| Layer | Tooling | Why it helps |
|---|---|---|
| TIP | OpenCTI, MISP | Knowledge graphing, STIX in/out, connectors, collaboration. docs.opencti.io+1 |
| Exchange | STIX 2.1 / TAXII 2.1 | Vendor-neutral, standards-based sharing/publishing. OASIS Open+1 |
| Mapping/coverage | MITRE ATT&CK + Navigator | Normalized TTPs and visualization of detection gaps. MITRE ATT&CK+1 |
| Extraction | spaCy, Transformers (HF) | Production-grade NER + customizable models. spacy.io+1 |
5) Human-in-the-loop (HITL) keeps you honest
- Analyst review gates: promote items to “published” only after a short check of precision (especially TTP mappings).
- CISA’s ATT&CK mapping guidance: avoid “wishful mapping” and biases; require evidence strings linking text to technique IDs. CISA
- Feedback loops: false positives go back to training (regex tweaks, prompt updates, model fine-tuning).
6) Quality & ROI: measure these, or it didn’t happen
- Extraction P/R/F1 for IOCs & TTPs (label 200–500 sentences; update quarterly).
- Latency: ingest→publish p50/p95.
- Coverage delta: techniques with active detections before vs after intel import (Navigator layer diff). MITRE ATT&CK
- SOC impact: time saved per case, auto-enrichment hit rate, ratio of auto-closed low-risk alerts.
- Cost to value: GPU/CPU time vs analyst hours saved.
7) Production safeguards
- Confidence scoring & source weighting (vendor reputation, age, sightings).
- De-dup & decay: older IOCs auto-downgrade unless re-sighted.
- Toxic data filters: block “copy-pasted” attack chains from Reddit/unknown gists without corroboration.
- Tenant-aware exports: separate workforce vs customer intel where licensing requires it.
8) 30/60/90-day rollout
Days 1–30 (Pilot)
- Stand up OpenCTI or MISP; wire TAXII input, attach a small set of trusted sources. docs.opencti.io+1
- Ship IOC extraction + basic ATT&CK heuristics; publish STIX 2.1 to a sandbox collection. OASIS Open
- Start a 200-sentence golden set for evaluation.
Days 31–60 (Harden)
- Add HITL UI, confidence tiers, and auto-dedup; enrich with WHOIS/passive DNS; auto-create Navigator layers for coverage reviews. MITRE ATT&CK
- Begin SIEM/SOAR wiring: blocklists for high-confidence IOCs; analytics for common techniques.
Days 61–90 (Operate)
- Expand TTP rules; add model fine-tuning for domain-specific jargon; schedule weekly metrics; open TAXII to internal consumers. docs.oasis-open.org
9) Playbooks
IOC → Action (high-confidence)
- Publish STIX Indicator (+ Sighting if seen).
- Create SOAR task to block (URL/IP/hash) and hunt last 30 days.
- Expire after N days without sightings.
TTP → Action
- Add ATT&CK technique to Navigator; check detection gap. MITRE ATT&CK
- If gap exists: create SIEM rule/sigma/JEA script task.
- Backfill search & case.
10) Build vs buy (fast guidance)
- Buy platform; build extractors. Most teams win with a commercial/open TIP + custom NLP on top.
- Red flags: no STIX/TAXII, no ATT&CK alignment, black-box ML without feedback loops, no export to SIEM/SOAR.
FAQs
Is LLM summarization safe for CTI?
Yes—with prompt constraints, source citations, and a human approval step for high-impact summaries.
Why not rely only on regex?
Rules give precision; ML adds recall and generalizes to unseen formats. Use both.
Can we auto-map techniques?
Use weak/strong evidence tiers + analyst review. CISA’s paper highlights common mapping errors—treat it as policy. CISA
Sources & primers
- MITRE ATT&CK enterprise matrix, techniques & tools (Navigator). MITRE ATT&CK+2MITRE ATT&CK+2
- STIX 2.1 spec & examples; TAXII 2.1 spec & intro docs. oasis-open.github.io+3OASIS Open+3oasis-open.github.io+3
- MISP project docs; OpenCTI docs & repo. GitHub+3misp-project.org+3misp-project.org+3
- spaCy NER API & 101; Hugging Face token-classification pipelines. spacy.io+2spacy.io+2
- CISA: Best Practices for ATT&CK Mapping (analyst bias & evidence). CISA
#CyberDudeBivash #ThreatIntelligence #NLP #AI #CTI #IOC #TTP #MITREATTACK #STIX #TAXII #MISP #OpenCTI #SIEM #SOAR #XDR #SOCAutomation #OSINT #Summarization #EntityRecognition
Leave a comment