Executive summary
Detection and hunting are how defenders turn telemetry into decisions. Detection engineering codifies known bad (TTPs) into reliable, low-noise analytics. Threat hunting is the hypothesis-driven search for the unknown using context, anomaly signals, and analyst intuition. This playbook gives you a production-ready approach: what data to collect, how to shape it, patterns to detect, practical queries for Windows/Linux/Cloud, hunting workflows, quality gates, and KPIs.
1) Foundations: detection vs hunting
- Detection engineering: repeatable analytics (“detections-as-code”) with tests, owners, deployment pipelines, and SLAs. Output = alerts.
- Threat hunting: iterative investigations without waiting for alerts. Output = new detections, intel, hardening tasks.
Both share the same raw materials: telemetry → normalization → enrichment → analytics → action.
2) Telemetry strategy (Minimal Viable Telemetry)
Endpoint
- Process: parent/child, full command line, integrity level, hashes, signer, image path.
- File: create/rename/delete, entropy, extension/type mismatch.
- Registry (Win): run keys, services, LSA providers.
- Network: per-process flows (dst, port, bytes, JA3/JA4, SNI/host).
- Memory: module loads, injection indicators (RWX,
VirtualAllocEx,CreateRemoteThread).
Identity & SaaS
- Auth logs: success/fail, MFA, geo, device posture, risk flags.
- OAuth: consent grants, new app registrations, token lifetimes/scopes.
- Mail/Drive/Share: sharing changes, mass downloads/deletes.
Cloud
- Control plane: IAM changes, policy updates, key usage.
- Data plane: object access, egress byte deltas.
- Compute: metadata service access, container exec/priv-esc, unusual images.
Network (sensor or cloud PCAP/flow)
- DNS (query name, NXDomain rate, TTL), HTTP (host, path, UA), TLS (SNI, JA3/JA4), NetFlow.
Normalize with a schema (ECS/OSSEM) and time-sync everything (NTP). Enrich with asset/owner tags, GeoIP/ASN, threat intel, process reputation.
3) Detection engineering lifecycle
- Hypothesis/TTP (map to ATT&CK sub-technique).
- Data contract (fields required, sources).
- Rule/analytic (KQL/SPL/Sigma/EQL).
- Tests: unit (synthetic logs), replay (pcaps/evt), adversary emulation (Atomic Red Team).
- Quality gates: data freshness, field completeness, cardinality limits, false-positive review.
- Deploy with detections-as-code (Git + CI/CD). Track owner, SLA, MTTD/PPV (precision).
Tip: write detections around behaviors, not hashes. Hashes rot; TTPs persist.
4) Core behavioral patterns (with ready-to-use analytics)
A) Initial access & execution (Windows)
Suspicious PowerShell (download/execution, AMSI bypass attempts)
kustoCopyEditDeviceProcessEvents
| where FileName =~ "powershell.exe" or FileName =~ "pwsh.exe"
| where ProcessCommandLine has_any ("IEX","DownloadString","FromBase64String","-enc","AMSI","Add-MpPreference","Bypass")
| extend Parent=InitiatingProcessFileName
| project Timestamp, DeviceName, AccountName, Parent, FileName, ProcessCommandLine
LOLBin abuse
rundll32,mshta,wmic,certutil,bitsadminlaunching network connections or scripts.
B) Persistence & privilege escalation
New auto-start extensibility points
kustoCopyEditDeviceRegistryEvents
| where ActionType in ("RegistryValueSet","RegistryKeyCreated")
| where RegistryKey has_any (
@"\Software\Microsoft\Windows\CurrentVersion\Run",
@"\Services", @"\Policies\System\Shell"
)
Service installs from user-writable paths
kustoCopyEditDeviceProcessEvents
| where FileName in ("sc.exe","powershell.exe")
| where ProcessCommandLine has "create" and ProcessCommandLine has " binPath="
| where ProcessCommandLine has_any ("\\AppData\\","\\Temp\\",".\\")
C) Credential access & discovery
Suspicious LSASS access
kustoCopyEditDeviceProcessEvents
| where ProcessCommandLine has "lsass"
or (ProcessIntegrityLevel != "System" and
InitiatingProcessFileName !in ("procexp64.exe") and
ProcessCommandLine has_any ("ReadProcessMemory","MiniDump"))
Kerberoasting prep
kustoCopyEditSigninLogs
| where ResultType == 0 and AuthenticationRequirement == "singleFactorAuthentication"
| where AppDisplayName has "Kerberos" and ServicePrincipalName has ":"
| summarize count() by UserPrincipalName, bin(TimeGenerated, 15m)
| where count_ > 50
D) Lateral movement
WMI/PSRemoting from workstations
kustoCopyEditDeviceProcessEvents
| where FileName in ("wmic.exe","winrs.exe","powershell.exe")
| where RemoteUrl != "" and InitiatingProcessAccountDomain != "SERVER"
E) Command & control / beaconing
Flow periodicity & low-and-slow
splCopyEdit| tstats `summariesonly` count, avg(_time) as avg_t BY dest, src, dest_port, app span=1m
| timechart span=1m count BY dest
| eval jitter=stdev(count)/avg(count)
| where jitter < 0.15 AND avg(count) < 2
(Flag regular intervals with small jitter + small volume → beacon suspects.)
F) Exfiltration
DNS tunneling heuristic
splCopyEdit| stats count, avg(len(query)) as avglen, values(rcode) as r, dc(count) as uniq by src_ip
| where avglen > 40 OR uniq > 500
Sudden egress spike to new ASN
kustoCopyEditDeviceNetworkEvents
| summarize bytes=sum(ReportBytesSent) by DeviceId, RemoteIP, ASN, bin(Timestamp, 10m)
| extend z = (bytes - avg(bytes) over (partition by DeviceId range between 6h preceding and current row))
/ stdev(bytes) over (partition by DeviceId range between 6h preceding and current row)
| where z > 6 and ASN !in ("YourTrustedCDN","CorpProxy")
G) Ransomware staging
- Rapid file rename/write with high entropy, shadow copy deletion, suspicious backup/defender tampering.
kustoCopyEditDeviceProcessEvents
| where ProcessCommandLine has_any ("vssadmin delete shadows","wbadmin delete","bcdedit /set recoveryenabled No")
5) Linux & macOS essentials
Linux: new listener by an unusual binary
bashCopyEdit# osquery
SELECT p.pid, p.path, l.port, l.address
FROM processes p JOIN listening_ports l ON p.pid=l.pid
WHERE p.path NOT LIKE '/usr/%' AND p.path NOT LIKE '/bin/%';
Linux: privilege escalation surfaces
sudoersedits, setuid bit changes, unprivileged eBPF use,ld.so.preload,cronentries in user writeable dirs.
macOS: persistence
- LaunchAgents/LaunchDaemons from
~/Library/LaunchAgents/with network reach-outs; unsigned binaries allowed via user click → hunt Gatekeeper bypass traces.
6) Cloud detections that matter
Azure AD risky impossible travel
kustoCopyEditlet baseline = SigninLogs
| summarize make_set(Country) by UserPrincipalName;
SigninLogs
| summarize first(TimeGenerated) as firstSeen, make_set(Country) by UserPrincipalName, bin(TimeGenerated, 1h)
| join kind=inner baseline on UserPrincipalName
| where array_length(set_Country) > 1 and datetime_diff("hour", min(TimeGenerated), max(TimeGenerated)) < 1
AWS key misuse
- AccessKey used from new ASN + S3 List/Get flood + CloudTrail DeleteTrail attempts → high-risk triage.
sqlCopyEdit-- Athena/CloudTrail Lake (pseudo)
SELECT userIdentity.accessKeyId, COUNT(*) c
FROM cloudtrail
WHERE eventSource='s3.amazonaws.com' AND eventName in ('GetObject','ListBucket')
AND src_ip NOT IN (SELECT ip FROM allowlist)
GROUP BY 1
HAVING c > 5000;
GCP service account drift
- New key creation followed by BigQuery export → egress monitor around the key’s first use.
7) Threat hunting workflow (4-hour cycle)
- Choose a seed: a TTP (e.g., DLL sideloading), an anomaly (new JA3), or new intel (domain set).
- State a hypothesis: “We will find unsigned binaries sideloaded by Office spawning
rundll32with network egress.” - Scoping queries: broad → narrow. Save notebooks (Jupyter + MSTICPy/SQL/SPL).
- Pivot: by parent process, user, host, ASN, signer, hash cluster, JA3 cluster.
- Document leads: promote to detection if repeatable; file hardening/IR tickets if real risk.
- Retro hunt (30–90 days) for newly found IOCs/TTPs.
Add a hunt register: hypothesis, coverage, queries, outcomes, follow-ups.
8) Machine learning that actually helps
- Outlier/anomaly: z-scores/Isolation Forest on per-host command counts, child-process trees, DNS lengths.
- Beacon detection: spectral analysis (FFT) on inter-arrival times.
- Clustering: group command lines (TF-IDF + HDBSCAN) to surface “weird” exec strings.
- Graph features: user–host–process graphs; detect unusual edges.
Guardrails: strict explainability, feedback loops to analysts, and feature drift monitors. Use ML to prioritize and suggest pivots, not to auto-close cases.
9) Detections-as-code: quality & testing
- Repo layout:
/detections/<domain>/<technique>/<rule>.yml(Sigma), with test fixtures, sample logs, owners. - Pre-merge CI: schema lint, data-contract checks, simulated log replays, expected FP rate vs baseline.
- Post-deploy canary: 1–5% of fleet; compare alert precision; auto-rollback if PPV < threshold.
Coverage KPIs
- % ATT&CK TTPs with at least one high-confidence analytic.
- Alert PPV (precision) per rule, MTTD, MTTR, time-to-contain.
- Data completeness (non-null critical fields) and ingest latency.
10) Triage cheatsheet (first 10 minutes)
- Confirm behavior: execution + network + persistence? (Need ≥2 to escalate.)
- Scope blast radius: same user, same signer, same JA3, same ASN.
- Kill-chain phase: access, discovery, C2, actions on objectives → match controls.
- Decide action: isolate host / block token / revoke OAuth consent / disable key / block domain.
- Create feedback: If benign, write suppression rule with rationale and expiry.
Appendices
A) Sigma example — Suspicious CertUtil
yamlCopyEdittitle: Suspicious CertUtil With URL Download
id: 3b9d2a23-2f68-4c3f-86e4-certexample
status: stable
logsource:
product: windows
category: process_creation
detection:
selection:
Image|endswith: '\certutil.exe'
CommandLine|contains:
- ' -urlcache '
- ' -split '
- 'http'
condition: selection
level: high
tags: [attack.defense-evasion, attack.t1105]
B) Zeek hunting cues
weird.log: excessive trunc/rexmit → C2/bad middleboxes.conn.log: periodicorig_pkts=1 resp_pkts=1pairs.dns.log: long labels, high NXDomain ratio.
Final word
Great security teams ship detections and iterate through hunts. Make telemetry trustworthy, codify behaviors, test relentlessly, and measure outcomes. Everything else is noise.
— CyberDudeBivash
Leave a comment