eBPF in OpenShift: finding the namespace

eBPF recently became very popular - for tracing, debugging, and performance measurements. Here I'll show how to use it in an OpenShift environment to find out which container resp. namespace is responsible for some behaviour.

Note: This is not an introductory text to eBPF (extended Berkeley Packet Filter) -- please read good books for that. This is just a story about using it in an OpenShift PaaS.

One of the major pain points when doing any tracing on an entire node is finding out which container or pod is responsible for some syscall. There's been some discussion about namespace support in eBPF; one recent addition is cgroup_path, but that is not (yet) available on RedHat's current kernel versions. (Neither is override(), sadly -- that would have lots of good uses, even if it can't be used directly in tracepoint:syscalls:*!!)

The currently available bpftrace version for RedHat Enterprise Linux 8 is v0.12.1; that's not quite fresh, but suffices for our needs here.

After some digging through kernel headers we get this:

tracepoint:syscalls:sys_enter_exec* {
    printf("exec from %s\n", str(curtask->cgroups->subsys[2]->cgroup->kn->name));
}

only to be disappointed by the error message (broken for readability)

stdin:1:69-114: ERROR: Unknown struct/union: 'struct kernfs_node'
tracepoint:syscalls:sys_enter_execve* {
    printf("exec from %s\n",
      str(curtask->cgroups->subsys[2]->cgroup->kn->name)); }
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Thorough checking shows that "struct kernfs_node" is available from /sys/kernel/btf/vmlinux, as expected...

After quite some time and lots of experiments this can be shown as a bug - adding another clause seems to make bpftrace parse more structures and then succeed:

tracepoint:syscalls:sys_enter_exec* {
    printf("exec from %s\n", str(curtask->cgroups->subsys[2]->cgroup->kn->name));
}

tracepoint:syscalls:sys_exit_clone {
    if (args->ret) {
        @void[1]=args->ret;
    }
}

So we arrive at a kind-of PaaS-compatible "execsnoop" version like this, with the namespace information moved out because of the already-too-long lines:

tracepoint:syscalls:sys_enter_exec* {
    $ns = curtask->cgroups->subsys[2]->cgroup->kn;
    if (!@namespace[$ns]) {
        @namespace[$ns] = 1;
        @namespace_str[$ns] = 
            (str($ns->parent->name,200),
             str($ns->name,200));
    }

    printf("%15lld: %15s:%8d in %p: execve %s; ",
        nsecs/1000, comm, pid, curtask->parent->pid, $ns,
        str(args->filename,200));
    join(args->argv);
}

tracepoint:syscalls:sys_exit_clone {
    if (args->ret) { @void[0] = args->ret; }
}

END {
    clear(@namespace);
    clear(@void);
}

This gives a (more or less nicely formatted) output

  2583423567531:   runc:[2:INIT]: 2205910 in 0xffff98425d580e58: execve /usr/bin/grpc_health_probe; grpc_health_probe -addr=:50051

and a summary of namespaces seen:

@namespace_str[0xffff98425d580e58]: (kubepods-burstable-podda0fc44d_097e_4b6f_9acb_ccef37f65eab.slic, crio-1b0d078fe69bdf8ffe47c06f9f0dc119cd1d7bf49358bee34774cf99b5)

The latter information ("crio-") allows us to look up detailed information via "crictl inspect 1b0d078fe" (yes, the ID can be abbreviated; depending on your debug container's contents you might need a "chroot /host/", though).

{
  "status": {
    "id": "1b0d078fe69bdf8ffe47c06f9f0dc119cd1d7bf49358bee34774cf99b52aba99",
    "metadata": {
      "attempt": 0,
      "name": "registry-server"
    },
    "state": "CONTAINER_RUNNING",
    ...
    "labels": {
      "io.kubernetes.container.name": "registry-server",
      "io.kubernetes.pod.name": "redhat-operators-kpsnv",
      "io.kubernetes.pod.namespace": "openshift-marketplace",
      "io.kubernetes.pod.uid": "da0fc44d-097e-4b6f-9acb-ccef37f65eab"
    },
...
}

Based on that template we can now identify lots of interesting behaviour - and associate it with the container concerned...

PS: Yes, it's possible to record the process IDs (PIDs), all fork()/clone() calls and an initial snapshot of the process tree, and later on associate PIDs backwards until we know which process something originated from.

That gets much more involved, though - at least the OpenShift services themselves (like "crio") start things in other namespaces, so we'd need to record (and reverse-lookup) a lot more syscalls - "setns", "unshare", and last but not least writing to special filenames like "/host/sys/fs/cgroup/systemd/kubepods.slice/**/tasks"...

Much easier to traverse and report the namespaces seen during tracing.

The Austrian Public Services Blockchain

The BRZ got a useful idea for a Blockchain project: notarizing document existence via their cryptographic hashes, codename "Blockstempel".

more The Austrian Public Services Blockchain

eBPF in OpenShift: finding the namespace

author

Philipp Marek

The Austrian Public Services Blockchain

Lines of Code