Module 2.3: Capabilities & Linux Security Modules
Linux Foundations | Complexity:
[MEDIUM]| Time: 25-30 min
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Required: Module 1.4: Users & Permissions
- Required: Module 2.1: Linux Namespaces
- Helpful: Understanding of basic security concepts
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After this module, you will be able to:
- Explain Linux capabilities as fine-grained alternatives to running as root
- Audit a container’s capabilities and identify which are unnecessary
- Configure AppArmor and seccomp profiles to restrict container system calls
- Evaluate the security trade-offs between dropping capabilities vs using LSM profiles
Why This Module Matters
Section titled “Why This Module Matters”Traditional Unix had a simple security model: root can do everything, everyone else is restricted. This all-or-nothing approach is dangerous—why give a process full root power when it only needs to bind to port 80?
Capabilities break root’s superpowers into granular pieces. Linux Security Modules (LSMs) add mandatory access controls beyond discretionary permissions.
Understanding these helps you:
- Secure containers — Drop unnecessary capabilities
- Debug permission errors — Why can’t my container do X?
- Implement least privilege — Give only the access needed
- Understand Kubernetes security — SecurityContext, PodSecurityPolicies, seccomp
When your container fails with “operation not permitted” despite running as root, capabilities are usually the answer.
Did You Know?
Section titled “Did You Know?”-
There are over 40 different capabilities in modern Linux. CAP_NET_ADMIN alone controls dozens of networking operations, from configuring interfaces to modifying routing tables.
-
Docker drops many capabilities by default — A container with “root” is missing CAP_SYS_ADMIN, CAP_NET_ADMIN, and others. This is why root in a container isn’t the same as root on the host.
-
The
pingcommand used to require setuid root — Now it uses CAP_NET_RAW capability instead. This is much safer because ping can only send raw packets, not do everything root can. -
seccomp can block over 300 system calls — Kubernetes and Docker apply a default seccomp profile that blocks dangerous syscalls like
kexec_load(which loads a new kernel) andreboot.
Linux Capabilities
Section titled “Linux Capabilities”The Problem with Root
Section titled “The Problem with Root”Traditional Unix:
┌─────────────────────────────────────────────────────────────────┐│ TRADITIONAL MODEL ││ ││ Root (UID 0) Normal User (UID 1000) ││ ┌──────────────────────┐ ┌──────────────────────┐ ││ │ ALL PRIVILEGES │ │ Limited privileges │ ││ │ - Bind any port │ │ - Can't bind < 1024 │ ││ │ - Read any file │ │ - Own files only │ ││ │ - Kill any process │ │ - Own processes only │ ││ │ - Load kernel modules│ │ - No kernel access │ ││ │ - Reboot system │ │ - Can't reboot │ ││ │ - Change any config │ │ - Limited config │ ││ └──────────────────────┘ └──────────────────────┘ ││ ││ ALL or NOTHING - dangerous for services that need only ONE ││ capability from the "ALL" bucket │└─────────────────────────────────────────────────────────────────┘Stop and think: If a web server process is compromised, why might it be significantly worse if it was running as a traditional root user compared to running as a non-root user that has only been granted the
CAP_NET_BIND_SERVICEcapability? What specific actions could an attacker take in the first scenario that are blocked in the second?
Capabilities: Granular Privileges
Section titled “Capabilities: Granular Privileges”┌─────────────────────────────────────────────────────────────────┐│ CAPABILITIES MODEL ││ ││ Root's powers split into ~40 capabilities: ││ ││ ┌───────────────┬───────────────┬───────────────┐ ││ │ CAP_NET_BIND_ │ CAP_CHOWN │ CAP_KILL │ ││ │ SERVICE │ │ │ ││ │ Bind < 1024 │ Change owner │ Kill any proc │ ││ └───────────────┴───────────────┴───────────────┘ ││ ┌───────────────┬───────────────┬───────────────┐ ││ │ CAP_SYS_ADMIN │ CAP_NET_ADMIN │ CAP_SYS_PTRACE│ ││ │ Many things! │ Network cfg │ Trace procs │ ││ └───────────────┴───────────────┴───────────────┘ ││ ┌───────────────┬───────────────┬───────────────┐ ││ │ CAP_NET_RAW │ CAP_SETUID │ CAP_SETGID │ ││ │ Raw sockets │ Set UID │ Set GID │ ││ └───────────────┴───────────────┴───────────────┘ ││ ... │└─────────────────────────────────────────────────────────────────┘Common Capabilities
Section titled “Common Capabilities”| Capability | What It Allows | Used By |
|---|---|---|
| CAP_NET_BIND_SERVICE | Bind to ports < 1024 | Web servers |
| CAP_NET_RAW | Raw sockets, ping | ping, network tools |
| CAP_NET_ADMIN | Network configuration | Network management |
| CAP_SYS_ADMIN | Many admin operations | Container escapes! |
| CAP_CHOWN | Change file ownership | File management |
| CAP_SETUID/SETGID | Change process UID/GID | su, sudo |
| CAP_KILL | Send signals to any process | Process management |
| CAP_SYS_PTRACE | Trace/debug processes | Debuggers, strace |
| CAP_MKNOD | Create device nodes | Device setup |
| CAP_DAC_OVERRIDE | Bypass file permissions | Full file access |
Viewing Capabilities
Section titled “Viewing Capabilities”# View capabilities of current processcat /proc/$$/status | grep Cap
# Decode capability hex valuescapsh --decode=0000003fffffffff
# View capabilities of a filegetcap /usr/bin/ping
# View capabilities of running processgetpcaps $$
# List all capabilitiescapsh --printCapability Sets
Section titled “Capability Sets”Each process has multiple capability sets:
| Set | Purpose |
|---|---|
| Permitted | Maximum capabilities available |
| Effective | Currently active capabilities |
| Inheritable | Can be passed to children |
| Bounding | Limits what can be gained |
| Ambient | Preserved across execve() |
# View all setscat /proc/$$/status | grep Cap# CapInh: Inheritable# CapPrm: Permitted# CapEff: Effective# CapBnd: Bounding# CapAmb: AmbientSetting File Capabilities
Section titled “Setting File Capabilities”# Give a program capability without setuidsudo setcap 'cap_net_bind_service=+ep' /path/to/program
# Verifygetcap /path/to/program
# Remove capabilitiessudo setcap -r /path/to/programContainer Capabilities
Section titled “Container Capabilities”Docker Default Capabilities
Section titled “Docker Default Capabilities”Docker containers run with a restricted set:
Default Docker capabilities:- CAP_CHOWN- CAP_DAC_OVERRIDE- CAP_FSETID- CAP_FOWNER- CAP_MKNOD- CAP_NET_RAW- CAP_SETGID- CAP_SETUID- CAP_SETFCAP- CAP_SETPCAP- CAP_NET_BIND_SERVICE- CAP_SYS_CHROOT- CAP_KILL- CAP_AUDIT_WRITE
NOT included (dangerous):- CAP_SYS_ADMIN ← Container escape risk!- CAP_NET_ADMIN ← Network manipulation- CAP_SYS_PTRACE ← Debug other processes- CAP_SYS_MODULE ← Load kernel modulesModifying Container Capabilities
Section titled “Modifying Container Capabilities”# Drop all capabilitiesdocker run --cap-drop=ALL nginx
# Add specific capabilitydocker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE nginx
# Run privileged (ALL capabilities - dangerous!)docker run --privileged nginxPause and predict: If you run a container with
--cap-drop=ALLbut do not change the user, the processes inside will still technically be running as user ID 0 (root). If this containerized root user attempts to modify a file owned by another user, will the operation succeed? Why or why not?
Kubernetes SecurityContext
Section titled “Kubernetes SecurityContext”apiVersion: v1kind: Podspec: containers: - name: app securityContext: capabilities: drop: - ALL # Drop everything first add: - NET_BIND_SERVICE # Add only what's neededLinux Security Modules (LSMs)
Section titled “Linux Security Modules (LSMs)”Beyond DAC (discretionary access control) and capabilities, LSMs provide mandatory access control (MAC).
LSM Architecture
Section titled “LSM Architecture”┌─────────────────────────────────────────────────────────────────┐│ SECURITY STACK ││ ││ ┌─────────────────────────────────────────────────────────┐ ││ │ Application │ ││ └────────────────────────────┬────────────────────────────┘ ││ │ ││ System Call ││ │ ││ ┌────────────────────────────▼────────────────────────────┐ ││ │ DAC (Discretionary Access Control) │ ││ │ File permissions (rwx) │ ││ └────────────────────────────┬────────────────────────────┘ ││ │ ││ ┌────────────────────────────▼────────────────────────────┐ ││ │ Capabilities │ ││ │ Does process have CAP_*? │ ││ └────────────────────────────┬────────────────────────────┘ ││ │ ││ ┌────────────────────────────▼────────────────────────────┐ ││ │ LSM │ ││ │ AppArmor / SELinux / Seccomp / SMACK │ ││ │ Mandatory Access Control (MAC) │ ││ └────────────────────────────┬────────────────────────────┘ ││ │ ││ Kernel │└─────────────────────────────────────────────────────────────────┘Available LSMs
Section titled “Available LSMs”| LSM | Distribution | Approach |
|---|---|---|
| SELinux | RHEL, CentOS, Fedora | Label-based, complex |
| AppArmor | Ubuntu, Debian, SUSE | Path-based, simpler |
| Seccomp | All (filter) | System call filtering |
| SMACK | Embedded systems | Simplified labels |
Check What’s Active
Section titled “Check What’s Active”# Which LSM is activecat /sys/kernel/security/lsm
# AppArmor statussudo aa-status
# SELinux statussestatusgetenforceAppArmor
Section titled “AppArmor”Path-based mandatory access control. Simpler than SELinux.
Profile Modes
Section titled “Profile Modes”| Mode | Behavior |
|---|---|
| Enforce | Blocks and logs violations |
| Complain | Logs but doesn’t block |
| Unconfined | No restrictions |
Viewing Profiles
Section titled “Viewing Profiles”# List profilessudo aa-status
# Sample output:# apparmor module is loaded.# 32 profiles are loaded.# 30 profiles are in enforce mode.# /usr/bin/evince# /usr/sbin/cups-browsed# docker-default
# View a profilecat /etc/apparmor.d/usr.bin.evinceAppArmor Profile Structure
Section titled “AppArmor Profile Structure”#include <tunables/global>
/usr/sbin/nginx { #include <abstractions/base> #include <abstractions/nameservice>
# Allow network access network inet tcp, network inet udp,
# Allow reading config /etc/nginx/** r,
# Allow writing logs /var/log/nginx/** rw,
# Allow web root /var/www/** r,
# Deny everything else by default}Stop and think: If an attacker successfully gains code execution inside your Nginx process and attempts to overwrite an HTML file in
/var/www/, what will AppArmor do based on the profile above? Will standard Linux file permissions (DAC) even be evaluated?
Container AppArmor
Section titled “Container AppArmor”# Docker default profiledocker run --security-opt apparmor=docker-default nginx
# Custom profiledocker run --security-opt apparmor=my-custom-profile nginx
# No AppArmor (dangerous!)docker run --security-opt apparmor=unconfined nginxKubernetes AppArmor
Section titled “Kubernetes AppArmor”apiVersion: v1kind: Podmetadata: annotations: container.apparmor.security.beta.kubernetes.io/app: localhost/my-profilespec: containers: - name: app image: nginxSeccomp
Section titled “Seccomp”Secure Computing Mode — Filters system calls at the kernel level.
How Seccomp Works
Section titled “How Seccomp Works”┌─────────────────────────────────────────────────────────────────┐│ SECCOMP FILTER ││ ││ Application wants to call: open("/etc/passwd", O_RDONLY) ││ │ ││ ▼ ││ ┌─────────────────────────────────────────────────────────┐ ││ │ Seccomp BPF Filter │ ││ │ │ ││ │ Is syscall "open" (nr=2) allowed? │ ││ │ → Check profile rules │ ││ │ → ALLOW / ERRNO / KILL / LOG │ ││ └─────────────────────────────────────────────────────────┘ ││ │ ││ ┌──────────────────┼──────────────────┐ ││ ▼ ▼ ▼ ││ ALLOW ERRNO(EPERM) KILL ││ (proceed) (return error) (kill process) │└─────────────────────────────────────────────────────────────────┘Default Docker Seccomp Profile
Section titled “Default Docker Seccomp Profile”Docker blocks ~44 syscalls by default:
{ "defaultAction": "SCMP_ACT_ERRNO", "syscalls": [ { "names": ["accept", "accept4", "access", "..."], "action": "SCMP_ACT_ALLOW" } ], "blocked": [ "kexec_load", // Load new kernel "reboot", // Reboot system "mount", // Mount filesystems (by default) "ptrace", // Trace processes (often blocked) "...40+ others" ]}Seccomp Actions
Section titled “Seccomp Actions”| Action | Effect |
|---|---|
| SCMP_ACT_ALLOW | Allow syscall |
| SCMP_ACT_ERRNO | Return error code |
| SCMP_ACT_KILL | Kill process |
| SCMP_ACT_KILL_PROCESS | Kill all threads |
| SCMP_ACT_LOG | Allow but log |
| SCMP_ACT_TRACE | Notify tracer |
Container Seccomp
Section titled “Container Seccomp”# Use default profile (recommended)docker run nginx
# Custom profiledocker run --security-opt seccomp=/path/to/profile.json nginx
# No seccomp (dangerous!)docker run --security-opt seccomp=unconfined nginxKubernetes Seccomp
Section titled “Kubernetes Seccomp”apiVersion: v1kind: Podspec: securityContext: seccompProfile: type: RuntimeDefault # Use container runtime's default containers: - name: app securityContext: seccompProfile: type: Localhost localhostProfile: profiles/my-profile.jsonCommon Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Running containers as privileged | Full capabilities, escape risk | Use specific capabilities instead |
| Not dropping capabilities | Unnecessary attack surface | Drop ALL, add only needed |
| Disabling seccomp | Allows dangerous syscalls | Use RuntimeDefault or custom profile |
| Ignoring AppArmor/SELinux | Missing MAC protection | Keep enabled, use container profiles |
| CAP_SYS_ADMIN “for convenience” | Major security risk | Find specific capability needed |
| No capabilities understanding | Can’t debug permission issues | Learn common capabilities |
Question 1
Section titled “Question 1”You are auditing a Kubernetes cluster and notice a Pod specification where the developer has requested the CAP_SYS_ADMIN capability “just in case” they need to debug network issues later. What is the actual security risk of allowing this capability in a container environment?
Show Answer
CAP_SYS_ADMIN is often referred to as “the new root” because it is a massive catch-all capability that grants a wide array of powerful administrative privileges. If a container has this capability, a compromised process within the container can potentially mount filesystems, manipulate namespaces, and use ptrace on other processes. This drastically increases the likelihood of a container escape, allowing the attacker to gain full control over the underlying host node. Therefore, it violates the principle of least privilege and should never be granted merely for convenience or debugging.
Question 2
Section titled “Question 2”Your team wants to secure a legacy application container. One engineer suggests using AppArmor to prevent the application from reading /etc/shadow, while another suggests using seccomp to block the execve system call so it can’t spawn a shell. How do these two security mechanisms fundamentally differ in their approach to restricting the container?
Show Answer
AppArmor and seccomp operate at different layers of the Linux security stack to provide complementary protections. AppArmor is a Mandatory Access Control (MAC) system that uses path-based rules to restrict which files, directories, and network resources an application can access, making it ideal for blocking reads to specific locations like /etc/shadow. In contrast, seccomp filters at a lower level by restricting the actual system calls (like execve, open, or kill) that a process is allowed to make to the kernel, regardless of the target file path. Using both together provides defense-in-depth: AppArmor controls what resources can be touched, while seccomp controls how the process can interact with the kernel.
Question 3
Section titled “Question 3”You are deploying an Nginx reverse proxy container that needs to listen on port 80. By default, the container runs as root, which your security team has flagged as a policy violation. You change the user to a non-root user in the Dockerfile, but now Nginx crashes on startup because it cannot bind to the port. How should you configure the container’s capabilities to solve this securely?
Show Answer
To securely allow the non-root container to bind to a privileged port (under 1024), you should explicitly drop all default capabilities and only add NET_BIND_SERVICE. In Docker, this is done using the flags --cap-drop=ALL --cap-add=NET_BIND_SERVICE, and in Kubernetes, it is configured within the securityContext.capabilities block of the manifest. This approach adheres to the principle of least privilege by stripping away the default set of capabilities (like CAP_CHOWN or CAP_KILL) that Nginx does not actually need to function as a web server. By doing this, even if the Nginx process is compromised, the attacker’s ability to pivot or escalate privileges is severely limited.
Question 4
Section titled “Question 4”A developer is struggling to get a containerized VPN client to work because it needs to create virtual network interfaces (tun/tap devices). Frustrated, they add the --privileged flag to their docker run command, which immediately solves the problem. Why must you reject this pull request and require a different approach?
Show Answer
The --privileged flag effectively disables almost all of the security isolation mechanisms that make containers safe to run on a shared host. It grants the container all available Linux capabilities, removes AppArmor and seccomp filtering, and exposes all host devices to the container. If the VPN client in this container is compromised, the attacker essentially has full root access to the underlying Docker host and can trivially escape the container boundary. Instead of using --privileged, the developer should identify the exact capability needed (like CAP_NET_ADMIN) and explicitly add only that capability, perhaps alongside exposing only the specific /dev/net/tun device.
Question 5
Section titled “Question 5”You have applied a strict seccomp profile to your application container that blocks the mkdir system call. The application has a bug and attempts to create a directory for caching on startup. What exactly happens to the application process when it makes this system call, assuming the seccomp rule is configured with the SCMP_ACT_ERRNO action?
Show Answer
When the application attempts to invoke the blocked mkdir system call, the kernel’s seccomp BPF filter intercepts the request and evaluates it against the loaded profile. Because the action is defined as SCMP_ACT_ERRNO, the kernel immediately blocks the system call from executing and returns a standard permission error (typically EPERM or EACCES) back to the application. The application process itself is not automatically killed by the kernel; it merely receives a failure code from the system call. It is then up to the application’s internal error handling logic to decide whether to gracefully shut down, log the error and continue, or crash ungracefully.
Hands-On Exercise
Section titled “Hands-On Exercise”Capabilities and Security Modules
Section titled “Capabilities and Security Modules”Objective: Explore capabilities, AppArmor, and seccomp.
Environment: Linux system (Ubuntu/Debian for AppArmor examples)
Part 1: Viewing Capabilities
Section titled “Part 1: Viewing Capabilities”# 1. Your process capabilitiescat /proc/$$/status | grep Cap
# 2. Decode themcapsh --decode=$(grep CapEff /proc/$$/status | cut -f2)
# 3. Check a common programgetcap /usr/bin/ping 2>/dev/null || getcap /bin/ping
# 4. List all files with capabilitiesgetcap -r / 2>/dev/null | head -20Part 2: File Capabilities (requires root)
Section titled “Part 2: File Capabilities (requires root)”# 1. Create a test programcat > /tmp/test-bind.c << 'EOF'#include <stdio.h>#include <sys/socket.h>#include <netinet/in.h>
int main() { int sock = socket(AF_INET, SOCK_STREAM, 0); struct sockaddr_in addr = { .sin_family = AF_INET, .sin_port = htons(80), .sin_addr.s_addr = INADDR_ANY }; if (bind(sock, (struct sockaddr*)&addr, sizeof(addr)) < 0) { perror("bind failed"); return 1; } printf("Successfully bound to port 80!\n"); return 0;}EOF
# 2. Compilegcc /tmp/test-bind.c -o /tmp/test-bind
# 3. Try as normal user (should fail)/tmp/test-bind# bind failed: Permission denied
# 4. Add capabilitysudo setcap 'cap_net_bind_service=+ep' /tmp/test-bind
# 5. Verifygetcap /tmp/test-bind
# 6. Try again (should work)/tmp/test-bind
# 7. Clean uprm /tmp/test-bind /tmp/test-bind.cPart 3: AppArmor (Ubuntu/Debian)
Section titled “Part 3: AppArmor (Ubuntu/Debian)”# 1. Check AppArmor statussudo aa-status
# 2. List profilesls /etc/apparmor.d/
# 3. View a profilecat /etc/apparmor.d/usr.sbin.tcpdump 2>/dev/null || \ cat /etc/apparmor.d/usr.bin.firefox 2>/dev/null | head -50Part 4: Seccomp Information
Section titled “Part 4: Seccomp Information”# 1. Check if seccomp is enabledgrep SECCOMP /boot/config-$(uname -r)
# 2. View process seccomp statusgrep Seccomp /proc/$$/status
# 3. If Docker installed, view default profiledocker run --rm alpine cat /proc/1/status | grep SeccompPart 5: Container Capabilities (if Docker available)
Section titled “Part 5: Container Capabilities (if Docker available)”# 1. Default capabilitiesdocker run --rm alpine sh -c 'cat /proc/1/status | grep Cap'
# 2. Drop all capabilitiesdocker run --rm --cap-drop=ALL alpine sh -c 'cat /proc/1/status | grep Cap'
# 3. Try privileged operationsdocker run --rm alpine ping -c 1 8.8.8.8 # Works (CAP_NET_RAW)docker run --rm --cap-drop=ALL alpine ping -c 1 8.8.8.8 # Failsdocker run --rm --cap-drop=ALL --cap-add=NET_RAW alpine ping -c 1 8.8.8.8 # WorksSuccess Criteria
Section titled “Success Criteria”- Viewed and decoded process capabilities
- Found programs with file capabilities
- (Optional) Set and used file capabilities
- Explored AppArmor status
- Understood seccomp status
- (Docker) Tested capability dropping
Key Takeaways
Section titled “Key Takeaways”-
Capabilities split root power — 40+ specific privileges instead of all-or-nothing
-
Drop ALL, add specific — Best practice for container security
-
LSMs add MAC — Beyond DAC, mandatory controls enforce policy
-
AppArmor = paths, seccomp = syscalls — Complementary security layers
-
—privileged is dangerous — Gives ALL capabilities, disables protections
What’s Next?
Section titled “What’s Next?”In Module 2.4: Union Filesystems, you’ll learn how container images use layered filesystems for efficient storage and sharing.