Executive summary
Containers and Kubernetes move fast—and so do attackers. Most real incidents are misconfig + weak identity + supply-chain rather than “exotic kernel 0-days”. But container escapes and API-server flaws still happen. This guide maps the attack surface for Docker/containerd/runc and Kubernetes control/worker planes, walks through common exploits, and gives copy-paste hardening, detection, and response you can implement today.
1) How the stack fits together (threat model)
Container runtime path: image → containerd/CRI-O → runc (creates Linux namespaces/cgroups) → kernel.
Kubernetes control plane: API Server ⇄ etcd (state) ⇄ Controller Manager & Scheduler; nodes run kubelet + CNI (network) + CSI (storage) + runtime.
Trust boundaries (red zones):
- Container ↔ host kernel (escapes, privilege escalation).
- Kubelet / API Server (authn/authz mistakes, open ports).
- etcd (cluster secrets).
- Admission & Supply chain (malicious images, mutable tags).
- CNI/Network (flat east-west traffic).
- Volumes / hostPath (symlink & subPath tricks).
2) Container vulnerabilities & exploit patterns
2.1 Runtime escapes (runc/containerd)
- runc “/proc/self/exe” & FD leaks (e.g., CVE-2019-5736; later variants like CVE-2024-21626): attacker overwrites or abuses the runc binary/FD during
execto gain host execution. - containerd CRI bugs (e.g., CVE-2022-23648): crafted image/manifests or mounts leading to unexpected host access.
- Kernel bugs reachable from namespaces: UAF/overlayfs flaws to write on host.
Exploit flow (typical): run a malicious image → trigger runc/containerd bug during docker exec or pod start → obtain host shell → pivot to node credentials → cluster admin via kubelet API or cloud IAM.
2.2 Capability & privilege abuse
- Containers running as root with
CAP_SYS_ADMIN(or--privileged) can: mount filesystems, manipulate cgroups, or access/devto break isolation. - No seccomp/AppArmor/SELinux → dangerous syscalls (e.g.,
ptrace,bpf) allowed.
2.3 hostPath & subPath volume attacks
- hostPath mounts expose host directories; combined with symlink races → arbitrary host file write.
- subPath volume handling historically hit symlink/TOCTOU issues (e.g., CVE-2021-25741 class).
- Kubernetes Windows hostProcess containers can reach host services if misused.
2.4 Image & registry supply-chain
- Typosquatted/malicious images on public registries.
- Mutable tags (
:latest) pull different bits over time; poisoned base images or Dockerfile FROM chains. - Embedded secrets in layers (forgotten
.npmrc, SSH keys). - Build system takeover (CI tokens → push trojaned images).
3) Kubernetes vulnerabilities & exploit paths
3.1 API server & aggregated APIs
- Proxy/upgrade request mishandling can forward attacker traffic to backends (e.g., CVE-2018-1002105 class), yielding privilege escalation.
- Over-permissive RBAC (e.g.,
create pods/secretsin prod) converts any workload compromise into cluster admin.
3.2 etcd exposure
- etcd stores Secrets. If reachable without TLS+auth, one GET = full cluster compromise.
3.3 Kubelet issues
- Legacy readonly port (10255) exposure leaks metrics/pods; misconfigured kubelet credentials allow
exec/cpinto pods.
3.4 CNI / Network
- Flat networks let a compromised pod scan and reach every service.
- DNAT/externalIPs misconfig (e.g., CVE-2020-8554 class) can enable MITM within cluster.
3.5 Admission & policy gaps
- No Pod Security / PodSecurityPolicy (legacy) replacement → pods request privileged, hostPID, hostNetwork, or unsafe capabilities and get them.
- Missing image signature policy lets untrusted images in.
4) Realistic attacker playbooks
Playbook A — Malicious public image → node → cluster
- Pull image
company/app:latest(poisoned). - Container phones home, drops toolkit, enumerates service account token.
- If pod has
list/create podsorsecrets, attacker spawns privileged pod / steals cluster secrets. - Uses cloud metadata IAM via kube-node role to escalate in cloud.
Playbook B — Exposed Docker API (2375) or kubelet creds
- Internet scan finds unauth Docker/kubelet; attacker
docker run --privileged -v /:/hostorkubectl exec. - Writes SSH key to
/host/root/.ssh/authorized_keys; persistence achieved. - Mass-deploy crypto-miners or exfil data.
Playbook C — hostPath/subPath write
- Dev grants
hostPath: /var/lib/kubelet/podsto sidecar for debugging. - Attacker symlinks to
/etc/shadowor kubelet cert dir → writes controlled content. - Replaces node service or steals kubelet cert → cluster admin.
5) Hardening that actually works (copy-paste)
5.1 Container run options (least privilege)
yamlCopyEdit# pod-security.yaml (snippet)
securityContext:
runAsNonRoot: true
runAsUser: 10000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
seccompProfile:
type: RuntimeDefault # or Localhost
# Avoid host namespaces & devices unless absolutely necessary
hostNetwork: false
hostPID: false
hostIPC: false
5.2 Block dangerous mounts
- Avoid
hostPath. If required, usetype: DirectoryOrCreate, read-only, and path-allowlists. - Do not mount
/var/run/docker.sock, kubelet dirs, or/proc//sysinto pods.
5.3 Admission control: Pod Security + policy engines
Pod Security Admission (v1.25+): label namespaces to baseline/restricted:
yamlCopyEditmetadata:
labels:
pod-security.kubernetes.io/enforce: "restricted"
pod-security.kubernetes.io/audit: "restricted"
pod-security.kubernetes.io/warn: "restricted"
Kyverno/Gatekeeper examples:
- Deny
privileged: true,hostNetwork: true,hostPID: true. - Require
runAsNonRoot,readOnlyRootFilesystem,seccompProfile: RuntimeDefault. - Enforce
image: registry.example.com/*and signature required (Sigstore/cosign).
5.4 Image supply-chain controls
- Build SBOMs (Syft) and scan (Trivy/Grype).
- Sign images:
cosign sign --key kms://… <image>; admission webhook rejects unsigned. - Pin immutable digests:
yamlCopyEditimage: registry.example.com/api@sha256:3c1f... # not :latest
- Block root in Dockerfiles:
USER 10000:10000.
5.5 Node & runtime hardening
- Keep runc/containerd current; enable AppArmor/SELinux; lock kernel with LSMs.
- Enable seccomp default (
RuntimeDefault). - Isolate node IAM roles; disable cloud metadata access from non-system pods (IMDSv2 hop-limit, firewall).
5.6 Network policies (CNI)
yamlCopyEditapiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: default-deny, namespace: prod }
spec:
podSelector: {}
policyTypes: ["Ingress","Egress"]
Then add allow policies per app (db only from app, egress only to needed APIs).
5.7 etcd & control plane
- etcd TLS/mTLS only; isolate on control-plane network; encryption at rest.
- API server: audit logging, admission webhooks mTLS, restrict anonymous auth, rate-limit.
6) Detection & response (ready-to-use)
6.1 Falco rules (container breakout attempts)
yamlCopyEdit- rule: Write below root
desc: Container writing to sensitive host paths
condition: write_to_known_sensitive_file
output: "Write to sensitive file (user=%user.name proc=%proc.name file=%fd.name)"
priority: CRITICAL
6.2 K8s audit log – privileged pod creation
jsonCopyEdit{"stage":"ResponseComplete","verb":"create","objectRef":{"resource":"pods"},"requestObject":{"spec":{"containers":[{"securityContext":{"privileged":true}}]}}}
Alert on any such event outside break-glass namespaces.
6.3 Hunt queries
- Container spawning shell from web svc:
- Parent:
nginx/httpd/java/w3wp→ Child:bash/sh/powershell/cmd
- Parent:
- Kubelet exec storms: audit
execcount per user/node > baseline. - Image drift: deployment digest ≠ last approved digest.
6.4 Incident playbook (short)
- Contain: cordon node or isolate namespace; block egress via NetworkPolicy; revoke service-account tokens.
- Triage: fetch pod
describe, container logs, nodejournalctl, Falco events, kube-audit trail, image digest & SBOM. - Scope: search for similar pods, suspicious admissions, unsigned images.
- Recover: redeploy from signed images; rotate secrets; re-issue kubelet cert if node compromised.
- Lessons: add/adjust admission policies; write regression tests.
7) Program plan (30–60–90)
Days 1–30 – Baseline
- Enforce Pod Security restricted in all namespaces.
- Default-deny NetworkPolicy + egress allow-lists.
- Block
:latest; require cosign signatures for prod. - Patch runc/containerd; enable seccomp RuntimeDefault.
Days 31–60 – Supply-chain & policy
- SBOMs + image scanning in CI; fail builds on critical vulns.
- Kyverno/Gatekeeper policies (no privileged/host namespaces; require non-root).
- etct TLS/mTLS; secrets encryption at rest; rotate root certs.
Days 61–90 – Detection & drills
- Deploy Falco/eBPF sensor; wire to SIEM; create dashboards for admissions, exec, privileged attempts.
- Red-team: hostPath abuse, kubelet exec misuse, unsigned image admission.
- Practice node compromise → cluster recovery.
8) Quick checklist
- No privileged pods, no host namespaces, non-root users
- Seccomp/AppArmor/SELinux enforced
- NetworkPolicy default-deny + egress controls
- Signed images, immutable digests, SBOM + scan
- Admission control (Kyverno/Gatekeeper) enforcing baseline
- etcd mTLS + encryption at rest; API audit logs on
- Runtime patched (runc/containerd); kernel up-to-date
- Falco/eBPF + SIEM detections; incident playbooks tested
Closing
Most “container hacks” are preventable: kill privileged pods, enforce policy at admission, sign what you run, and observe what runs. Do that, and even if a new runc/K8s CVE appears, your blast radius is small and your recovery is fast.
Leave a comment