Module 4.4: seccomp Profiles
Linux Security | Complexity:
[MEDIUM]| Time: 25-30 min
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Required: Module 2.3: Capabilities & LSMs
- Required: Module 1.1: Kernel & Architecture (system calls)
- Helpful: Understanding of BPF concepts
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After this module, you will be able to:
- Write custom seccomp profiles that allow only the system calls a container needs
- Apply seccomp profiles to Kubernetes pods using the security context
- Debug seccomp violations by tracing blocked syscalls with strace and audit logs
- Evaluate the default Docker/containerd seccomp profile and decide when a custom profile is needed
Why This Module Matters
Section titled “Why This Module Matters”seccomp (Secure Computing Mode) filters system calls at the kernel level. Unlike AppArmor or SELinux which control resource access, seccomp controls which kernel functions a process can call at all.
Stop and think: If AppArmor is already restricting a container from accessing
/etc/shadow, what additional security value does blocking theopensystem call via seccomp provide? Consider the difference between restricting targets versus restricting actions.
Understanding seccomp helps you:
- Reduce attack surface — Block dangerous syscalls entirely
- Harden containers — Default Docker/K8s profiles block ~44 syscalls
- Pass CKS exam — seccomp profiles are directly tested
- Debug “operation not permitted” — When even capabilities aren’t the issue
When a container gets “operation not permitted” for something root should be able to do, seccomp might be blocking the underlying syscall.
Did You Know?
Section titled “Did You Know?”-
Linux has over 300 system calls — seccomp can filter any of them. Most applications use fewer than 100.
-
seccomp-bpf uses Berkeley Packet Filter — The same technology used for network packet filtering is used to filter syscalls. It’s incredibly efficient.
-
Chrome was an early seccomp adopter — Google implemented seccomp in Chrome’s sandbox in 2012 to protect against renderer exploits.
-
A seccomp violation is fatal by default — Unlike AppArmor (returns error), seccomp typically kills the process. This makes debugging harder but security tighter.
How seccomp Works
Section titled “How seccomp Works”System Call Filtering
Section titled “System Call Filtering”┌─────────────────────────────────────────────────────────────────┐│ SECCOMP FILTERING ││ ││ Application ││ │ ││ │ write(fd, buf, size) ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────┐ ││ │ glibc wrapper │ ││ └─────────────────────────────────────────────────────────┘ ││ │ ││ │ syscall(1, ...) // SYS_write = 1 ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────┐ ││ │ SECCOMP BPF FILTER │ ││ │ │ ││ │ if (syscall_nr == SYS_write) → ALLOW │ ││ │ if (syscall_nr == SYS_ptrace) → KILL │ ││ │ if (syscall_nr == SYS_reboot) → ERRNO(EPERM) │ ││ │ │ ││ └─────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ Kernel executes syscall (if allowed) │└─────────────────────────────────────────────────────────────────┘Pause and predict: If a seccomp filter evaluates a syscall and encounters multiple matching rules with different actions (e.g., one rule says
SCMP_ACT_ALLOWand another saysSCMP_ACT_KILL), how should the kernel resolve the conflict to prioritize security?
seccomp Actions
Section titled “seccomp Actions”| Action | Effect | Return |
|---|---|---|
SCMP_ACT_ALLOW | Permit syscall | Normal execution |
SCMP_ACT_ERRNO | Deny with error | Error code (e.g., EPERM) |
SCMP_ACT_KILL | Kill thread | SIGSYS |
SCMP_ACT_KILL_PROCESS | Kill process | SIGSYS |
SCMP_ACT_TRAP | Send SIGSYS | Can be handled |
SCMP_ACT_LOG | Allow but log | Normal execution |
SCMP_ACT_TRACE | Notify tracer | For debugging |
seccomp Modes
Section titled “seccomp Modes”Strict Mode
Section titled “Strict Mode”Original seccomp—extremely restrictive:
- Only
read,write,exit,sigreturnallowed - Any other syscall kills the process
- Rarely used directly
Filter Mode (seccomp-bpf)
Section titled “Filter Mode (seccomp-bpf)”Flexible filtering with BPF programs:
- Define custom allow/deny rules
- Check syscall arguments
- What Docker/Kubernetes use
# Check if seccomp is enabledgrep SECCOMP /boot/config-$(uname -r)# CONFIG_SECCOMP=y# CONFIG_SECCOMP_FILTER=y
# Check process seccomp statusgrep Seccomp /proc/$$/status# Seccomp: 0 (0=disabled, 1=strict, 2=filter)Profile Format
Section titled “Profile Format”Docker/Kubernetes JSON Format
Section titled “Docker/Kubernetes JSON Format”{ "defaultAction": "SCMP_ACT_ERRNO", "architectures": [ "SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_AARCH64" ], "syscalls": [ { "names": [ "read", "write", "open", "close", "stat", "fstat", "lseek", "mmap", "mprotect", "munmap" ], "action": "SCMP_ACT_ALLOW" }, { "names": ["ptrace"], "action": "SCMP_ACT_ERRNO", "args": [], "errnoRet": 1 } ]}Profile Structure
Section titled “Profile Structure”| Field | Purpose |
|---|---|
defaultAction | What to do if no rule matches |
architectures | CPU architectures to support |
syscalls | List of rules |
syscalls[].names | System call names |
syscalls[].action | What to do |
syscalls[].args | Argument conditions |
Default Profiles
Section titled “Default Profiles”Docker Default Profile
Section titled “Docker Default Profile”Docker blocks ~44 dangerous syscalls:
# Syscalls blocked by Docker default profile:# - acct - kernel accounting# - add_key - kernel keyring# - bpf - BPF programs# - clock_adjtime - adjust system clock# - clock_settime - set system clock# - clone - with dangerous flags# - create_module - load kernel modules# - delete_module - unload kernel modules# - finit_module - load kernel modules# - get_kernel_syms - kernel symbols# - init_module - load kernel modules# - ioperm - I/O port permissions# - iopl - I/O privilege level# - kcmp - compare kernel objects# - kexec_load - load new kernel# - keyctl - kernel keyring# - lookup_dcookie - kernel profiling# - mount - mount filesystems# - move_pages - move memory pages# - open_by_handle_at - bypass permissions# - perf_event_open - kernel profiling# - pivot_root - change root filesystem# - process_vm_* - access other process memory# - ptrace - debug other processes# - query_module - query kernel modules# - reboot - reboot system# - request_key - kernel keyring# - set_mempolicy - NUMA policy# - setns - enter namespaces# - settimeofday - set system time# - swapoff/swapon - swap management# - sysfs - deprecated# - umount - unmount filesystems# - unshare - create namespaces# - userfaultfd - page fault handlingChecking Container Profile
Section titled “Checking Container Profile”# Dockerdocker inspect <container> | jq '.[0].HostConfig.SecurityOpt'
# See if seccomp is active in containerdocker exec <container> grep Seccomp /proc/1/status# Seccomp: 2 (filter mode)Container seccomp
Section titled “Container seccomp”Docker seccomp
Section titled “Docker seccomp”# Use default profile (automatic)docker run nginx
# Custom profiledocker run --security-opt seccomp=/path/to/profile.json nginx
# Disable seccomp (dangerous!)docker run --security-opt seccomp=unconfined nginxStop and think: Kubernetes provides a
RuntimeDefaultseccomp profile. In a highly locked-down environment, why might relying solely onRuntimeDefaultbe insufficient for a container that only runs a simple static Go web server?
Kubernetes seccomp
Section titled “Kubernetes seccomp”apiVersion: v1kind: Podmetadata: name: secure-podspec: securityContext: seccompProfile: type: RuntimeDefault # Use container runtime's default containers: - name: app image: nginx securityContext: seccompProfile: type: Localhost localhostProfile: profiles/my-profile.jsonProfile Types in Kubernetes
Section titled “Profile Types in Kubernetes”| Type | Description |
|---|---|
RuntimeDefault | Container runtime’s default profile |
Localhost | Custom profile on node |
Unconfined | No seccomp (dangerous) |
Profile Locations
Section titled “Profile Locations”# Kubernetes looks for profiles at:/var/lib/kubelet/seccomp/profiles/
# Example/var/lib/kubelet/seccomp/profiles/my-profile.json
# Reference in pod:seccompProfile: type: Localhost localhostProfile: profiles/my-profile.jsonCreating Custom Profiles
Section titled “Creating Custom Profiles”Audit Mode Profile
Section titled “Audit Mode Profile”Start with logging to see what syscalls your app needs:
{ "defaultAction": "SCMP_ACT_LOG", "syscalls": []}Generating from Audit
Section titled “Generating from Audit”# Run with audit profiledocker run --security-opt seccomp=/path/to/audit-profile.json myapp
# Check audit logsudo dmesg | grep seccomp# Orsudo ausearch -m SECCOMP
# Use tools like strace to see syscallsstrace -f -c myappRestrictive Profile Example
Section titled “Restrictive Profile Example”{ "defaultAction": "SCMP_ACT_ERRNO", "architectures": ["SCMP_ARCH_X86_64"], "syscalls": [ { "names": [ "accept", "accept4", "access", "bind", "brk", "chdir", "chmod", "chown", "close", "connect", "dup", "dup2", "dup3", "epoll_create", "epoll_ctl", "epoll_wait", "execve", "exit", "exit_group", "fcntl", "fstat", "futex", "getcwd", "getdents", "getegid", "geteuid", "getgid", "getpid", "getppid", "getuid", "ioctl", "listen", "lseek", "mmap", "mprotect", "munmap", "nanosleep", "open", "openat", "pipe", "poll", "read", "readlink", "recvfrom", "recvmsg", "rename", "rt_sigaction", "rt_sigprocmask", "sendto", "set_tid_address", "setsockopt", "socket", "stat", "statfs", "unlink", "write", "writev" ], "action": "SCMP_ACT_ALLOW" } ]}Argument Filtering
Section titled “Argument Filtering”{ "syscalls": [ { "names": ["socket"], "action": "SCMP_ACT_ALLOW", "args": [ { "index": 0, "value": 2, "valueTwo": 0, "op": "SCMP_CMP_EQ" } ], "comment": "Only allow AF_INET sockets" } ]}Debugging seccomp
Section titled “Debugging seccomp”Finding Blocked Syscalls
Section titled “Finding Blocked Syscalls”# Check dmesg for seccomp killssudo dmesg | grep -i seccomp
# Example output:# audit: type=1326 audit(...): avc: denied { sys_admin }# for pid=1234 comm="myapp" syscall=157 compat=0# syscall=157 → look up in syscall table
# Look up syscall numberausyscall 157# or check /usr/include/asm/unistd_64.h
# Use strace to see what syscalls app makesstrace -f -o /tmp/strace.log myappgrep -E "^[0-9]" /tmp/strace.log | awk '{print $2}' | sort | uniq -c | sort -rnCommon Issues
Section titled “Common Issues”# Issue: Process killed immediately# Cause: Critical syscall blocked (like mmap, brk)# Fix: Add to allow list
# Issue: "Operation not permitted" but not killed# Cause: SCMP_ACT_ERRNO returning EPERM# Fix: Add syscall or check args filter
# Issue: Works without seccomp, fails with# Debug: Use SCMP_ACT_LOG temporarilyTesting Profiles
Section titled “Testing Profiles”# Test with a simple profilecat > /tmp/test-seccomp.json << 'EOF'{ "defaultAction": "SCMP_ACT_ERRNO", "syscalls": [ { "names": ["read", "write", "exit", "exit_group", "rt_sigreturn"], "action": "SCMP_ACT_ALLOW" } ]}EOF
# Testdocker run --rm --security-opt seccomp=/tmp/test-seccomp.json alpine echo "hello"# Should fail (echo needs more syscalls)Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Disabling seccomp | Dangerous syscalls allowed | Use RuntimeDefault at minimum |
| Profile too restrictive | App crashes | Use audit mode first |
| Missing architecture | Fails on ARM | Include all target archs |
| Not testing profile | Production failures | Test thoroughly in staging |
| Using unconfined | No syscall filtering | Always use a profile |
| KILL vs ERRNO | Hard to debug | Use ERRNO for debugging |
Question 1
Section titled “Question 1”You are deploying a legacy third-party application in a container. During staging, the application unexpectedly crashes immediately upon startup without writing any error logs. You suspect a seccomp violation. If the current profile uses SCMP_ACT_KILL as the default action, why would changing it to SCMP_ACT_ERRNO help diagnose the issue, and what are the security trade-offs of leaving it as SCMP_ACT_ERRNO in production?
Show Answer
Changing the action to SCMP_ACT_ERRNO prevents the kernel from immediately terminating the process with a SIGSYS signal, which is what SCMP_ACT_KILL does. Instead, the kernel blocks the system call and returns an error code (like EPERM) back to the application. This allows the application to handle the error gracefully or log a descriptive error message indicating that a specific operation was not permitted, making debugging significantly easier. However, leaving SCMP_ACT_ERRNO in production reduces security because an attacker exploiting the application can probe the system to map out exactly which system calls are permitted and which are blocked. SCMP_ACT_KILL is strictly safer because it instantly neutralizes compromised threads the moment they attempt an unauthorized action, giving the attacker no feedback.
Question 2
Section titled “Question 2”You have perfectly configured AppArmor to ensure a web server container can only read files in /var/www/html and cannot write anywhere on the filesystem. A colleague argues that applying a seccomp profile is redundant since the file access is already locked down. Based on how system calls function compared to Mandatory Access Control (MAC) systems, why is your colleague incorrect?
Show Answer
Your colleague is incorrect because AppArmor and seccomp operate at different conceptual layers and mitigate different classes of vulnerabilities. AppArmor and SELinux are Mandatory Access Control systems that mediate access to specific resources, such as file paths or network sockets, after the application has already invoked a system call. In contrast, seccomp filters the invocation of the system calls themselves, completely removing kernel surface area. For example, an attacker who finds a remote code execution vulnerability might attempt to execute system calls like ptrace or unshare to exploit kernel bugs or manipulate memory, which do not involve file paths at all. By using seccomp to block system calls the web server never needs, you eliminate the kernel attack surface entirely, providing defense in depth even if the MAC system is bypassed.
Question 3
Section titled “Question 3”A security audit requires you to apply a custom, highly restrictive seccomp profile named api-strict.json to a deployment of API pods. You have placed the JSON file in the correct directory on all Kubernetes worker nodes. When configuring the Pod specification, what exact fields must you define in the securityContext to ensure the pod utilizes this specific file rather than the container runtime’s default?
Show Answer
To apply a custom profile located on the worker nodes, you must configure the seccompProfile field within the pod’s or container’s securityContext to use the Localhost type. Specifically, you need to set type: Localhost and provide the relative path to the profile using the localhostProfile field, such as localhostProfile: profiles/api-strict.json. This tells the Kubelet to look in its default seccomp profile directory (typically /var/lib/kubelet/seccomp/) for the specified file and instruct the container runtime to apply it. If you were to use type: RuntimeDefault, the container would simply inherit the generic, broad profile provided by Docker or containerd, which would fail to meet the strict requirements of your security audit.
Question 4
Section titled “Question 4”An automated security scanner flags your container image because it requires the CAP_SYS_ADMIN capability to perform its legitimate function. To mitigate the risk of a container escape, you decide to write a custom seccomp profile. Which specific system calls should you explicitly ensure are blocked to prevent an attacker from creating new namespaces or mounting the host filesystem, and why?
Show Answer
To prevent namespace manipulation and filesystem mounting—common techniques for container escapes—you must explicitly block the unshare, setns, and mount system calls. The unshare system call allows a process to disassociate parts of its execution context, effectively allowing an attacker to create new, isolated namespaces where they might gain elevated privileges. The setns system call allows a process to enter an existing namespace, which could be abused to break out of the container’s isolation and join the host’s namespaces. Finally, the mount system call must be blocked because an attacker with CAP_SYS_ADMIN could use it to mount the host’s underlying root filesystem into the container, granting them full read and write access to the host node’s files. Blocking these system calls neutralizes the most direct vectors for escaping the container boundary.
Question 5
Section titled “Question 5”You are migrating a complex Java application to a hardened Kubernetes cluster that enforces a strict custom seccomp profile. Upon deployment, the application pod repeatedly enters a CrashLoopBackOff state. You suspect a critical system call is being denied, but the application logs show no errors before terminating. Outline the exact steps you would take at the host kernel level to identify the specific blocked system call.
Show Answer
Because seccomp violations often result in immediate process termination via SIGSYS (if SCMP_ACT_KILL is used), the application will not have a chance to write to its standard error logs. The most reliable way to identify the blocked system call is to inspect the kernel audit logs on the host node where the pod was scheduled. You can do this by SSHing into the worker node and running sudo dmesg | grep -i seccomp or sudo ausearch -m SECCOMP to view the audit trails. The kernel logs will display an audit event indicating that a system call was denied, along with the PID and the numeric system call ID (e.g., syscall=157). Once you have the numeric ID, you can use a tool like ausyscall <number> or reference the kernel header files (like unistd_64.h) to translate the number into the human-readable system call name, which you can then evaluate for inclusion in your custom profile.
Hands-On Exercise
Section titled “Hands-On Exercise”Working with seccomp
Section titled “Working with seccomp”Objective: Understand seccomp profiles and testing.
Environment: Linux with Docker installed
Part 1: Check seccomp Status
Section titled “Part 1: Check seccomp Status”# 1. Kernel supportgrep SECCOMP /boot/config-$(uname -r)
# 2. Your processgrep Seccomp /proc/$$/status
# 3. Container processdocker run --rm alpine grep Seccomp /proc/1/statusPart 2: Test Default Profile
Section titled “Part 2: Test Default Profile”# 1. Run with default seccompdocker run --rm alpine sh -c 'grep Seccomp /proc/1/status'# Should show: Seccomp: 2
# 2. Try to run unconfineddocker run --rm --security-opt seccomp=unconfined alpine grep Seccomp /proc/1/status# Should show: Seccomp: 0
# 3. Test a blocked syscall# reboot should fail silentlydocker run --rm alpine reboot# No effect (blocked by default profile)Part 3: Create Custom Profile
Section titled “Part 3: Create Custom Profile”# 1. Create a restrictive profilecat > /tmp/restrictive.json << 'EOF'{ "defaultAction": "SCMP_ACT_ERRNO", "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_AARCH64"], "syscalls": [ { "names": [ "read", "write", "open", "openat", "close", "stat", "fstat", "lstat", "mmap", "mprotect", "munmap", "brk", "exit_group", "arch_prctl", "access", "getuid", "getgid", "geteuid", "getegid", "execve", "fcntl", "dup2", "pipe", "rt_sigaction", "rt_sigprocmask", "rt_sigreturn", "clone", "wait4", "futex", "set_tid_address", "set_robust_list" ], "action": "SCMP_ACT_ALLOW" } ]}EOF
# 2. Test ls (should work)docker run --rm --security-opt seccomp=/tmp/restrictive.json alpine ls /
# 3. Test something more complex (might fail)docker run --rm --security-opt seccomp=/tmp/restrictive.json alpine ping -c 1 127.0.0.1# May fail due to missing syscallsPart 4: Audit Mode
Section titled “Part 4: Audit Mode”# 1. Create audit profilecat > /tmp/audit.json << 'EOF'{ "defaultAction": "SCMP_ACT_LOG", "syscalls": []}EOF
# 2. Run with auditdocker run --rm --security-opt seccomp=/tmp/audit.json alpine ls /
# 3. Check what was loggedsudo dmesg | grep seccomp | tail -20Part 5: Debug Blocked Syscalls
Section titled “Part 5: Debug Blocked Syscalls”# 1. Create overly restrictive profilecat > /tmp/broken.json << 'EOF'{ "defaultAction": "SCMP_ACT_ERRNO", "syscalls": [ { "names": ["exit_group"], "action": "SCMP_ACT_ALLOW" } ]}EOF
# 2. Try to run (will fail)docker run --rm --security-opt seccomp=/tmp/broken.json alpine echo hello 2>&1# Error or no output
# 3. Check what failedsudo dmesg | grep seccomp | tail -10Success Criteria
Section titled “Success Criteria”- Verified seccomp is enabled
- Compared default vs unconfined profiles
- Created and tested custom profile
- Used audit mode to see syscalls
- Debugged a blocked syscall
Key Takeaways
Section titled “Key Takeaways”-
Syscall-level filtering — More fundamental than file/network controls
-
Default profiles block 40+ syscalls — Docker/K8s have sane defaults
-
KILL vs ERRNO — Kill is secure, ERRNO is debuggable
-
Audit mode for development — Log all syscalls to build profile
-
Defense in depth — Use with AppArmor/SELinux, not instead of
What’s Next?
Section titled “What’s Next?”Congratulations! You’ve completed the Security/Hardening section. You now understand:
- Kernel hardening via sysctl
- AppArmor profiles and MAC
- SELinux contexts and type enforcement
- seccomp system call filtering
Next, continue to Operations/Performance to learn about system performance analysis.