Blog: Finding suspicious syscalls with the seccomp notifier

Authors: Sascha Grunert

Debugging software in production is one of the biggest challenges we have to
face in our containerized environments. Being able to understand the impact of
the available security options, especially when it comes to configuring our
deployments, is one of the key aspects to make the default security in
Kubernetes stronger. We have all those logging, tracing and metrics data already
at hand, but how do we assemble the information they provide into something
human readable and actionable?

Seccomp is one of the standard mechanisms to protect a Linux based
Kubernetes application from malicious actions by interfering with its system
. This allows us to restrict the application to a defined set of
actionable items, like modifying files or responding to HTTP requests. Linking
the knowledge of which set of syscalls is required to, for example, modify a
local file, to the actual source code is in the same way non-trivial. Seccomp
profiles for Kubernetes have to be written in JSON and can be understood
as an architecture specific allow-list with superpowers, for example:

 "defaultAction": "SCMP_ACT_ERRNO",
 "defaultErrnoRet": 38,
 "defaultErrno": "ENOSYS",
 "syscalls": [
 "names": ["chmod", "chown", "open", "write"],
 "action": "SCMP_ACT_ALLOW"

The above profile errors by default specifying the defaultAction of
SCMP_ACT_ERRNO. This means we have to allow a set of syscalls via
SCMP_ACT_ALLOW, otherwise the application would not be able to do anything at
all. Okay cool, for being able to allow file operations, all we have to do is
adding a bunch of file specific syscalls like open or write, and probably
also being able to change the permissions via chmod and chown, right?
Basically yes, but there are issues with the simplicity of that approach:

Seccomp profiles need to include the minimum set of syscalls required to start
the application. This also includes some syscalls from the lower level
Open Container Initiative (OCI) container runtime, for example
runc or crun. Beside that, we can only guarantee the required
syscalls for a very specific version of the runtimes and our application,
because the code parts can change between releases. The same applies to the
termination of the application as well as the target architecture we’re
deploying on. Features like executing commands within containers also require
another subset of syscalls. Not to mention that there are multiple versions for
syscalls doing slightly different things and the seccomp profiles are able to
modify their arguments. It’s also not always clearly visible to the developers
which syscalls are used by their own written code parts, because they rely on
programming language abstractions or frameworks.

How can we know which syscalls are even required then? Who should create and
maintain those profiles during its development life-cycle?

Well, recording and distributing seccomp profiles is one of the problem domains
of the Security Profiles Operator, which is already solving that. The
operator is able to record seccomp, SELinux and even
AppArmor profiles into a Custom Resource Definition (CRD),
reconciles them to each node and makes them available for usage.

The biggest challenge about creating security profiles is to catch all code
paths which execute syscalls. We could achieve that by having 100% logical
coverage of the application when running an end-to-end test suite. You get the
problem with the previous statement: It’s too idealistic to be ever fulfilled,
even without taking all the moving parts during application development and
deployment into account.

Missing a syscall in the seccomp profiles’ allow list can have tremendously
negative impact on the application. It’s not only that we can encounter crashes,
which are trivially detectable. It can also happen that they slightly change
logical paths, change the business logic, make parts of the application
unusable, slow down performance or even expose security vulnerabilities. We’re
simply not able to see the whole impact of that, especially because blocked
syscalls via SCMP_ACT_ERRNO do not provide any additional audit
logging on the system.

Does that mean we’re lost? Is it just not realistic to dream about a Kubernetes
where everyone uses the default seccomp profile? Should we
stop striving towards maximum security in Kubernetes and accept that it’s not
meant to be secure by default?

Definitely not. Technology evolves over time and there are many folks
working behind the scenes of Kubernetes to indirectly deliver features to
address such problems. One of the mentioned features is the seccomp notifier,
which can be used to find suspicious syscalls in Kubernetes.

The seccomp notify feature consists of a set of changes introduced in Linux 5.9.
It makes the kernel capable of communicating seccomp related events to the user
space. That allows applications to act based on the syscalls and opens for a
wide range of possible use cases. We not only need the right kernel version,
but also at least runc v1.1.0 (or crun v0.19) to be able to make the notifier
work at all. The Kubernetes container runtime CRI-O gets support for
the seccomp notifier in v1.26.0
. The new feature allows us to
identify possibly malicious syscalls in our application, and therefore makes it
possible to verify profiles for consistency and completeness. Let’s give that a

First of all we need to run the latest main version of CRI-O, because v1.26.0
has not been released yet at time of writing. You can do that by either
compiling it from the source code or by using the pre-built binary
bundle via the get-script. The seccomp notifier feature of CRI-O is
guarded by an annotation, which has to be explicitly allowed, for example by
using a configuration drop-in like this:

> cat /etc/crio/crio.conf.d/02-runtimes.conf
default_runtime = "runc"

allowed_annotations = [ "io.kubernetes.cri-o.seccompNotifierAction" ]

If CRI-O is up and running, then it should indicate that the seccomp notifier is
available as well:

> sudo ./bin/crio --enable-metrics
INFO[…] Starting seccomp notifier watcher
INFO[…] Serving metrics on :9090 via HTTP

We also enable the metrics, because they provide additional telemetry data about
the notifier. Now we need a running Kubernetes cluster for demonstration
purposes. For this demo, we mainly stick to the
hack/ approach to locally spawn a single node
Kubernetes cluster.

If everything is up and running, then we would have to define a seccomp profile
for testing purposes. But we do not have to create our own, we can just use the
RuntimeDefault profile which gets shipped with each container runtime. For
example the RuntimeDefault profile for CRI-O can be found in the
containers/common library.

Now we need a test container, which can be a simple nginx pod like

apiVersion: v1
kind: Pod
 name: nginx
 io.kubernetes.cri-o.seccompNotifierAction: "stop"
 restartPolicy: Never
 - name: nginx
 image: nginx:1.23.2
 type: RuntimeDefault

Please note the annotation io.kubernetes.cri-o.seccompNotifierAction, which
enables the seccomp notifier for this workload. The value of the annotation can
be either stop for stopping the workload or anything else for doing nothing
else than logging and throwing metrics. Because of the termination we also use
the restartPolicy: Never to not automatically recreate the container on

Let’s run the pod and check if it works:

> kubectl apply -f nginx.yaml
> kubectl get pods -o wide
nginx 1/1 Running 0 3m39s <none> <none>

We can also test if the web server itself works as intended:

> curl
<!DOCTYPE html>
<title>Welcome to nginx!</title>

While everything is now up and running, CRI-O also indicates that it has started
the seccomp notifier:

INFO[…] Injecting seccomp notifier into seccomp profile of container 662a3bb0fdc7dd1bf5a88a8aa8ef9eba6296b593146d988b4a9b85822422febb

If we would now run a forbidden syscall inside of the container, then we can
expect that the workload gets terminated. Let’s give that a try by running
chroot in the containers namespaces:

> kubectl exec -it nginx -- bash
root@nginx:/# chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
root@nginx:/# command terminated with exit code 137

The exec session got terminated, so it looks like the container is not running
any more:

> kubectl get pods
nginx 0/1 seccomp killed 0 96s

Alright, the container got killed by seccomp, do we get any more information
about what was going on?

> kubectl describe pod nginx
Name: nginx
 State: Terminated
 Reason: seccomp killed
 Message: Used forbidden syscalls: chroot (1x)
 Exit Code: 137
 Started: Mon, 14 Nov 2022 12:19:46 +0100
 Finished: Mon, 14 Nov 2022 12:20:26 +0100

The seccomp notifier feature of CRI-O correctly set the termination reason and
message, including which forbidden syscall has been used how often (1x). How
often? Yes, the notifier gives the application up to 5 seconds after the last
seen syscall until it starts the termination. This means that it’s possible to
catch multiple forbidden syscalls within one test by avoiding time-consuming
trial and errors.

> kubectl exec -it nginx -- chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
command terminated with exit code 125
> kubectl exec -it nginx -- chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
command terminated with exit code 125
> kubectl exec -it nginx -- swapoff -a
command terminated with exit code 32
> kubectl exec -it nginx -- swapoff -a
command terminated with exit code 32
> kubectl describe pod nginx | grep Message
 Message: Used forbidden syscalls: chroot (2x), swapoff (2x)

The CRI-O metrics will also reflect that:

> curl -sf localhost:9090/metrics | grep seccomp_notifier
# HELP container_runtime_crio_containers_seccomp_notifier_count_total Amount of containers stopped because they used a forbidden syscalls by their name
# TYPE container_runtime_crio_containers_seccomp_notifier_count_total counter
container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (1x)"} 1
container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (2x), swapoff (2x)"} 1

How does it work in detail? CRI-O uses the chosen seccomp profile and injects
SCMP_ACT_KILL_PROCESS or SCMP_ACT_KILL_THREAD. It also sets a local listener
path which will be used by the lower level OCI runtime (runc or crun) to create
the seccomp notifier socket. If the connection between the socket and CRI-O has
been established, then CRI-O will receive notifications for each syscall being
interfered by seccomp. CRI-O stores the syscalls, allows a bit of timeout for
them to arrive and then terminates the container if the chosen
seccompNotifierAction=stop. Unfortunately, the seccomp notifier is not able to
notify on the defaultAction, which means that it’s required to have
a list of syscalls to test for custom profiles. CRI-O does also state that
limitation in the logs:

INFO[…] The seccomp profile default action SCMP_ACT_ERRNO cannot be overridden to SCMP_ACT_NOTIFY,
which means that syscalls using that default action can't be traced by the notifier

As a conclusion, the seccomp notifier implementation in CRI-O can be used to
verify if your applications behave correctly when using RuntimeDefault or any
other custom profile. Alerts can be created based on the metrics to create long
running test scenarios around that feature. Making seccomp understandable and
easier to use will increase adoption as well as help us to move towards a more
secure Kubernetes by default!

Thank you for reading this blog post. If you’d like to read more about the
seccomp notifier, checkout the following resources:

Originally posted on Kubernetes – Production-Grade Container Orchestration

Deja una respuesta

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *