Skip to content

Module 4.4: seccomp Profiles

Hands-On Lab Available
Ubuntu advanced 30 min
Launch Lab ↗

Opens in Killercoda in a new tab

Linux Security | Complexity: [MEDIUM] | Time: 25-30 min

Before starting this module:


After this module, you will be able to:

  • Write custom seccomp profiles that allow only the system calls a container needs
  • Apply seccomp profiles to Kubernetes pods using the security context
  • Debug seccomp violations by tracing blocked syscalls with strace and audit logs
  • Evaluate the default Docker/containerd seccomp profile and decide when a custom profile is needed

seccomp (Secure Computing Mode) filters system calls at the kernel level. Unlike AppArmor or SELinux which control resource access, seccomp controls which kernel functions a process can call at all.

Stop and think: If AppArmor is already restricting a container from accessing /etc/shadow, what additional security value does blocking the open system call via seccomp provide? Consider the difference between restricting targets versus restricting actions.

Understanding seccomp helps you:

  • Reduce attack surface — Block dangerous syscalls entirely
  • Harden containers — Default Docker/K8s profiles block ~44 syscalls
  • Pass CKS exam — seccomp profiles are directly tested
  • Debug “operation not permitted” — When even capabilities aren’t the issue

When a container gets “operation not permitted” for something root should be able to do, seccomp might be blocking the underlying syscall.


  • Linux has over 300 system calls — seccomp can filter any of them. Most applications use fewer than 100.

  • seccomp-bpf uses Berkeley Packet Filter — The same technology used for network packet filtering is used to filter syscalls. It’s incredibly efficient.

  • Chrome was an early seccomp adopter — Google implemented seccomp in Chrome’s sandbox in 2012 to protect against renderer exploits.

  • A seccomp violation is fatal by default — Unlike AppArmor (returns error), seccomp typically kills the process. This makes debugging harder but security tighter.


┌─────────────────────────────────────────────────────────────────┐
│ SECCOMP FILTERING │
│ │
│ Application │
│ │ │
│ │ write(fd, buf, size) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ glibc wrapper │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ │ syscall(1, ...) // SYS_write = 1 │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ SECCOMP BPF FILTER │ │
│ │ │ │
│ │ if (syscall_nr == SYS_write) → ALLOW │ │
│ │ if (syscall_nr == SYS_ptrace) → KILL │ │
│ │ if (syscall_nr == SYS_reboot) → ERRNO(EPERM) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Kernel executes syscall (if allowed) │
└─────────────────────────────────────────────────────────────────┘

Pause and predict: If a seccomp filter evaluates a syscall and encounters multiple matching rules with different actions (e.g., one rule says SCMP_ACT_ALLOW and another says SCMP_ACT_KILL), how should the kernel resolve the conflict to prioritize security?

ActionEffectReturn
SCMP_ACT_ALLOWPermit syscallNormal execution
SCMP_ACT_ERRNODeny with errorError code (e.g., EPERM)
SCMP_ACT_KILLKill threadSIGSYS
SCMP_ACT_KILL_PROCESSKill processSIGSYS
SCMP_ACT_TRAPSend SIGSYSCan be handled
SCMP_ACT_LOGAllow but logNormal execution
SCMP_ACT_TRACENotify tracerFor debugging

Original seccomp—extremely restrictive:

  • Only read, write, exit, sigreturn allowed
  • Any other syscall kills the process
  • Rarely used directly

Flexible filtering with BPF programs:

  • Define custom allow/deny rules
  • Check syscall arguments
  • What Docker/Kubernetes use
Terminal window
# Check if seccomp is enabled
grep SECCOMP /boot/config-$(uname -r)
# CONFIG_SECCOMP=y
# CONFIG_SECCOMP_FILTER=y
# Check process seccomp status
grep Seccomp /proc/$$/status
# Seccomp: 0 (0=disabled, 1=strict, 2=filter)

{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_AARCH64"
],
"syscalls": [
{
"names": [
"read",
"write",
"open",
"close",
"stat",
"fstat",
"lseek",
"mmap",
"mprotect",
"munmap"
],
"action": "SCMP_ACT_ALLOW"
},
{
"names": ["ptrace"],
"action": "SCMP_ACT_ERRNO",
"args": [],
"errnoRet": 1
}
]
}
FieldPurpose
defaultActionWhat to do if no rule matches
architecturesCPU architectures to support
syscallsList of rules
syscalls[].namesSystem call names
syscalls[].actionWhat to do
syscalls[].argsArgument conditions

Docker blocks ~44 dangerous syscalls:

Terminal window
# Syscalls blocked by Docker default profile:
# - acct - kernel accounting
# - add_key - kernel keyring
# - bpf - BPF programs
# - clock_adjtime - adjust system clock
# - clock_settime - set system clock
# - clone - with dangerous flags
# - create_module - load kernel modules
# - delete_module - unload kernel modules
# - finit_module - load kernel modules
# - get_kernel_syms - kernel symbols
# - init_module - load kernel modules
# - ioperm - I/O port permissions
# - iopl - I/O privilege level
# - kcmp - compare kernel objects
# - kexec_load - load new kernel
# - keyctl - kernel keyring
# - lookup_dcookie - kernel profiling
# - mount - mount filesystems
# - move_pages - move memory pages
# - open_by_handle_at - bypass permissions
# - perf_event_open - kernel profiling
# - pivot_root - change root filesystem
# - process_vm_* - access other process memory
# - ptrace - debug other processes
# - query_module - query kernel modules
# - reboot - reboot system
# - request_key - kernel keyring
# - set_mempolicy - NUMA policy
# - setns - enter namespaces
# - settimeofday - set system time
# - swapoff/swapon - swap management
# - sysfs - deprecated
# - umount - unmount filesystems
# - unshare - create namespaces
# - userfaultfd - page fault handling
Terminal window
# Docker
docker inspect <container> | jq '.[0].HostConfig.SecurityOpt'
# See if seccomp is active in container
docker exec <container> grep Seccomp /proc/1/status
# Seccomp: 2 (filter mode)

Terminal window
# Use default profile (automatic)
docker run nginx
# Custom profile
docker run --security-opt seccomp=/path/to/profile.json nginx
# Disable seccomp (dangerous!)
docker run --security-opt seccomp=unconfined nginx

Stop and think: Kubernetes provides a RuntimeDefault seccomp profile. In a highly locked-down environment, why might relying solely on RuntimeDefault be insufficient for a container that only runs a simple static Go web server?

apiVersion: v1
kind: Pod
metadata:
name: secure-pod
spec:
securityContext:
seccompProfile:
type: RuntimeDefault # Use container runtime's default
containers:
- name: app
image: nginx
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/my-profile.json
TypeDescription
RuntimeDefaultContainer runtime’s default profile
LocalhostCustom profile on node
UnconfinedNo seccomp (dangerous)
Terminal window
# Kubernetes looks for profiles at:
/var/lib/kubelet/seccomp/profiles/
# Example
/var/lib/kubelet/seccomp/profiles/my-profile.json
# Reference in pod:
seccompProfile:
type: Localhost
localhostProfile: profiles/my-profile.json

Start with logging to see what syscalls your app needs:

{
"defaultAction": "SCMP_ACT_LOG",
"syscalls": []
}
Terminal window
# Run with audit profile
docker run --security-opt seccomp=/path/to/audit-profile.json myapp
# Check audit log
sudo dmesg | grep seccomp
# Or
sudo ausearch -m SECCOMP
# Use tools like strace to see syscalls
strace -f -c myapp
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64"],
"syscalls": [
{
"names": [
"accept", "accept4", "access", "bind", "brk",
"chdir", "chmod", "chown", "close", "connect",
"dup", "dup2", "dup3", "epoll_create", "epoll_ctl",
"epoll_wait", "execve", "exit", "exit_group",
"fcntl", "fstat", "futex", "getcwd", "getdents",
"getegid", "geteuid", "getgid", "getpid", "getppid",
"getuid", "ioctl", "listen", "lseek", "mmap",
"mprotect", "munmap", "nanosleep", "open", "openat",
"pipe", "poll", "read", "readlink", "recvfrom",
"recvmsg", "rename", "rt_sigaction", "rt_sigprocmask",
"sendto", "set_tid_address", "setsockopt", "socket",
"stat", "statfs", "unlink", "write", "writev"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
{
"syscalls": [
{
"names": ["socket"],
"action": "SCMP_ACT_ALLOW",
"args": [
{
"index": 0,
"value": 2,
"valueTwo": 0,
"op": "SCMP_CMP_EQ"
}
],
"comment": "Only allow AF_INET sockets"
}
]
}

Terminal window
# Check dmesg for seccomp kills
sudo dmesg | grep -i seccomp
# Example output:
# audit: type=1326 audit(...): avc: denied { sys_admin }
# for pid=1234 comm="myapp" syscall=157 compat=0
# syscall=157 → look up in syscall table
# Look up syscall number
ausyscall 157
# or check /usr/include/asm/unistd_64.h
# Use strace to see what syscalls app makes
strace -f -o /tmp/strace.log myapp
grep -E "^[0-9]" /tmp/strace.log | awk '{print $2}' | sort | uniq -c | sort -rn
Terminal window
# Issue: Process killed immediately
# Cause: Critical syscall blocked (like mmap, brk)
# Fix: Add to allow list
# Issue: "Operation not permitted" but not killed
# Cause: SCMP_ACT_ERRNO returning EPERM
# Fix: Add syscall or check args filter
# Issue: Works without seccomp, fails with
# Debug: Use SCMP_ACT_LOG temporarily
Terminal window
# Test with a simple profile
cat > /tmp/test-seccomp.json << 'EOF'
{
"defaultAction": "SCMP_ACT_ERRNO",
"syscalls": [
{
"names": ["read", "write", "exit", "exit_group", "rt_sigreturn"],
"action": "SCMP_ACT_ALLOW"
}
]
}
EOF
# Test
docker run --rm --security-opt seccomp=/tmp/test-seccomp.json alpine echo "hello"
# Should fail (echo needs more syscalls)

MistakeProblemSolution
Disabling seccompDangerous syscalls allowedUse RuntimeDefault at minimum
Profile too restrictiveApp crashesUse audit mode first
Missing architectureFails on ARMInclude all target archs
Not testing profileProduction failuresTest thoroughly in staging
Using unconfinedNo syscall filteringAlways use a profile
KILL vs ERRNOHard to debugUse ERRNO for debugging

You are deploying a legacy third-party application in a container. During staging, the application unexpectedly crashes immediately upon startup without writing any error logs. You suspect a seccomp violation. If the current profile uses SCMP_ACT_KILL as the default action, why would changing it to SCMP_ACT_ERRNO help diagnose the issue, and what are the security trade-offs of leaving it as SCMP_ACT_ERRNO in production?

Show Answer

Changing the action to SCMP_ACT_ERRNO prevents the kernel from immediately terminating the process with a SIGSYS signal, which is what SCMP_ACT_KILL does. Instead, the kernel blocks the system call and returns an error code (like EPERM) back to the application. This allows the application to handle the error gracefully or log a descriptive error message indicating that a specific operation was not permitted, making debugging significantly easier. However, leaving SCMP_ACT_ERRNO in production reduces security because an attacker exploiting the application can probe the system to map out exactly which system calls are permitted and which are blocked. SCMP_ACT_KILL is strictly safer because it instantly neutralizes compromised threads the moment they attempt an unauthorized action, giving the attacker no feedback.

You have perfectly configured AppArmor to ensure a web server container can only read files in /var/www/html and cannot write anywhere on the filesystem. A colleague argues that applying a seccomp profile is redundant since the file access is already locked down. Based on how system calls function compared to Mandatory Access Control (MAC) systems, why is your colleague incorrect?

Show Answer

Your colleague is incorrect because AppArmor and seccomp operate at different conceptual layers and mitigate different classes of vulnerabilities. AppArmor and SELinux are Mandatory Access Control systems that mediate access to specific resources, such as file paths or network sockets, after the application has already invoked a system call. In contrast, seccomp filters the invocation of the system calls themselves, completely removing kernel surface area. For example, an attacker who finds a remote code execution vulnerability might attempt to execute system calls like ptrace or unshare to exploit kernel bugs or manipulate memory, which do not involve file paths at all. By using seccomp to block system calls the web server never needs, you eliminate the kernel attack surface entirely, providing defense in depth even if the MAC system is bypassed.

A security audit requires you to apply a custom, highly restrictive seccomp profile named api-strict.json to a deployment of API pods. You have placed the JSON file in the correct directory on all Kubernetes worker nodes. When configuring the Pod specification, what exact fields must you define in the securityContext to ensure the pod utilizes this specific file rather than the container runtime’s default?

Show Answer

To apply a custom profile located on the worker nodes, you must configure the seccompProfile field within the pod’s or container’s securityContext to use the Localhost type. Specifically, you need to set type: Localhost and provide the relative path to the profile using the localhostProfile field, such as localhostProfile: profiles/api-strict.json. This tells the Kubelet to look in its default seccomp profile directory (typically /var/lib/kubelet/seccomp/) for the specified file and instruct the container runtime to apply it. If you were to use type: RuntimeDefault, the container would simply inherit the generic, broad profile provided by Docker or containerd, which would fail to meet the strict requirements of your security audit.

An automated security scanner flags your container image because it requires the CAP_SYS_ADMIN capability to perform its legitimate function. To mitigate the risk of a container escape, you decide to write a custom seccomp profile. Which specific system calls should you explicitly ensure are blocked to prevent an attacker from creating new namespaces or mounting the host filesystem, and why?

Show Answer

To prevent namespace manipulation and filesystem mounting—common techniques for container escapes—you must explicitly block the unshare, setns, and mount system calls. The unshare system call allows a process to disassociate parts of its execution context, effectively allowing an attacker to create new, isolated namespaces where they might gain elevated privileges. The setns system call allows a process to enter an existing namespace, which could be abused to break out of the container’s isolation and join the host’s namespaces. Finally, the mount system call must be blocked because an attacker with CAP_SYS_ADMIN could use it to mount the host’s underlying root filesystem into the container, granting them full read and write access to the host node’s files. Blocking these system calls neutralizes the most direct vectors for escaping the container boundary.

You are migrating a complex Java application to a hardened Kubernetes cluster that enforces a strict custom seccomp profile. Upon deployment, the application pod repeatedly enters a CrashLoopBackOff state. You suspect a critical system call is being denied, but the application logs show no errors before terminating. Outline the exact steps you would take at the host kernel level to identify the specific blocked system call.

Show Answer

Because seccomp violations often result in immediate process termination via SIGSYS (if SCMP_ACT_KILL is used), the application will not have a chance to write to its standard error logs. The most reliable way to identify the blocked system call is to inspect the kernel audit logs on the host node where the pod was scheduled. You can do this by SSHing into the worker node and running sudo dmesg | grep -i seccomp or sudo ausearch -m SECCOMP to view the audit trails. The kernel logs will display an audit event indicating that a system call was denied, along with the PID and the numeric system call ID (e.g., syscall=157). Once you have the numeric ID, you can use a tool like ausyscall <number> or reference the kernel header files (like unistd_64.h) to translate the number into the human-readable system call name, which you can then evaluate for inclusion in your custom profile.


Objective: Understand seccomp profiles and testing.

Environment: Linux with Docker installed

Terminal window
# 1. Kernel support
grep SECCOMP /boot/config-$(uname -r)
# 2. Your process
grep Seccomp /proc/$$/status
# 3. Container process
docker run --rm alpine grep Seccomp /proc/1/status
Terminal window
# 1. Run with default seccomp
docker run --rm alpine sh -c 'grep Seccomp /proc/1/status'
# Should show: Seccomp: 2
# 2. Try to run unconfined
docker run --rm --security-opt seccomp=unconfined alpine grep Seccomp /proc/1/status
# Should show: Seccomp: 0
# 3. Test a blocked syscall
# reboot should fail silently
docker run --rm alpine reboot
# No effect (blocked by default profile)
Terminal window
# 1. Create a restrictive profile
cat > /tmp/restrictive.json << 'EOF'
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_AARCH64"],
"syscalls": [
{
"names": [
"read", "write", "open", "openat", "close",
"stat", "fstat", "lstat", "mmap", "mprotect",
"munmap", "brk", "exit_group", "arch_prctl",
"access", "getuid", "getgid", "geteuid", "getegid",
"execve", "fcntl", "dup2", "pipe", "rt_sigaction",
"rt_sigprocmask", "rt_sigreturn", "clone", "wait4",
"futex", "set_tid_address", "set_robust_list"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
EOF
# 2. Test ls (should work)
docker run --rm --security-opt seccomp=/tmp/restrictive.json alpine ls /
# 3. Test something more complex (might fail)
docker run --rm --security-opt seccomp=/tmp/restrictive.json alpine ping -c 1 127.0.0.1
# May fail due to missing syscalls
Terminal window
# 1. Create audit profile
cat > /tmp/audit.json << 'EOF'
{
"defaultAction": "SCMP_ACT_LOG",
"syscalls": []
}
EOF
# 2. Run with audit
docker run --rm --security-opt seccomp=/tmp/audit.json alpine ls /
# 3. Check what was logged
sudo dmesg | grep seccomp | tail -20
Terminal window
# 1. Create overly restrictive profile
cat > /tmp/broken.json << 'EOF'
{
"defaultAction": "SCMP_ACT_ERRNO",
"syscalls": [
{
"names": ["exit_group"],
"action": "SCMP_ACT_ALLOW"
}
]
}
EOF
# 2. Try to run (will fail)
docker run --rm --security-opt seccomp=/tmp/broken.json alpine echo hello 2>&1
# Error or no output
# 3. Check what failed
sudo dmesg | grep seccomp | tail -10
  • Verified seccomp is enabled
  • Compared default vs unconfined profiles
  • Created and tested custom profile
  • Used audit mode to see syscalls
  • Debugged a blocked syscall

  1. Syscall-level filtering — More fundamental than file/network controls

  2. Default profiles block 40+ syscalls — Docker/K8s have sane defaults

  3. KILL vs ERRNO — Kill is secure, ERRNO is debuggable

  4. Audit mode for development — Log all syscalls to build profile

  5. Defense in depth — Use with AppArmor/SELinux, not instead of


Congratulations! You’ve completed the Security/Hardening section. You now understand:

  • Kernel hardening via sysctl
  • AppArmor profiles and MAC
  • SELinux contexts and type enforcement
  • seccomp system call filtering

Next, continue to Operations/Performance to learn about system performance analysis.