Troubleshooting by strace

Obscure command line options, lots of data, output readable only for the initiated - what's not to like?

"strace" is a Linux tool that intercepts system calls from one or more processes to log their inputs and outputs. (There are Windows equivalents, eg. "ProcMon" and the ETW suite.)

This helps to cut the search space for unexplainable behaviour in half - carefully looking at what goes into the OS kernel and what comes out again might offer enough clues to find the application or the kernel side (which includes all the hardware, most notably storage and networking!) at "fault".

Advantages of "strace" are:

it's everywhere -- if not already installed on the machine being analysed, likely to be available from the distribution;
it's output is more or less consistent across architectures (amd64, arm64, ...);
with a few options it can report times (micro-, nanoseconds) as well, which can be a tremendous help.

Disadvantages are:

If you're not careful, you can either shoot your interactive (ssh) session or even the whole machine.
The first one happens if you incur a loop, eg. by tracing the sshd process you're connected to and writing output to your console - each output will cause quite a few syscalls, which mean more output, which cause syscalls, ... ad infinitum.
The latter one is likely if you (try to) trace "kubelet" in an OpenShift cluster - this one not only relays your "oc debug node" resp. "oc exec" data, but manages everything on the worker; even when logging to a file the machine had to be rebooted to get it working properly again.
A small one is that older versions (RHEL6, RHEL7) have broken output -- in the sense that eg. the parens don't match up.
And the big disadvantage is: it incurs overhead, so it's not usable as a performance measurement tool as such.

My favourite call goes like this (see the man page linked above for details):

strace -fttT -o <filename> -s <bytes> -p <pid>

and I vary the number of bytes depending on my needs -- from zero (when the data itself doesn't matter, and/or to reduce overhead) to lots-of-KB (to capture whole network communications, eg. if it's not easily possible to use "tcpdump").

The following is a heavily redacted example (timestamp and PID removed, lines broken and shortened) for looking at a DNS request done via "host brz.gv.at":

socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 9 <0.000058>
connect(9, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("…")}, 16) = 0 <0.000045>
sendmmsg(9, [{msg_hdr={msg_name=NULL, 
      msg_namelen=0, 
      msg_iov=[{iov_base="Gf\1\0\0\1\0\0\0\0\0\0\3brz\2gv\2at\0\0\17\0\1", iov_len=27}],
      msg_iovlen=1, msg_controllen=0, mgs=0},
    msg_len=27}], 1, 0) = 1 <0.000087>
recvmsg(9, {msg_name={sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("…")},
      msg_namelen=128 => 16,
      msg_iov=[{iov_base="Gf\201\200\0\1\0\0\0\1\0\0\3v\2"..., iov_len=1310700}],
      msg_iovlen=1, msg_controllen=0, msg_flags=0},
    0) = 97 <0.000065>

The last line shows the result being received; if the "strace" command had specified "-s 97" or higher, the complete DNS answer would be visible.

Here are a few examples of its usefulness.

Compare (hardware) configurations

A few VMs are notably slower than some others - load balancer logfiles show a consistent mismatch.

By comparing the timing of a null-overhead syscall, eg. via

strace -e getuid --syscall-times=ns -- id

the same performance difference could be seen -- but as this syscall has no external dependencies (ie. doesn't use the network or storage), only the raw machine itself could be the culprit. Further investigation showed that after an update cycle the BIOS settings were wrong - CPU and RAM speed were reset.

Network (and storage) performance diagnosis

When looking at network operations (like "send", "recv", "sendmsg", "recvmsg", "write" or "read" to/from a socket, etc.), by associating requests and answers it's possible to see latencies of remote services. If individual "write"s take too long, network problems (TCP buffers, packet loss, ICMP troubles, ...) might even be discerned.

For storage operations it's not quite that easy - a filesystem with the associated complexity (syncs, datasyncs, journaling, ...) typically causes write amplification, the backend-storage (rotating disks, SSDs, Fibrechannel, iSCSI, NVMe, NFS, ...) might have its own caching (and acknowledging) strategy depending on the battery-backup charge (a broken or nearly empty battery might mean direct writes instead of write buffering, and therefore higher latency), and so on.
Still, comparing two similar systems might help to diagnose the issue.

When the data is encrypted (eg. with TLS connections), you won't see the actual requests (SQL statements, HTTP queries, etc.). In that case,

the IP address and port are enough information to identify the service;
or the (microsecond) timestamp, along with the local IP address, can be associated with a log entry (HTTP or database log) on the remote side;
as a last resort, the "SSLKEYLOGFILE" environment variable might be able to help.

Quick architecture check

Tracing an application might show up a few easy things to fix, like

Starting and closing HTTPs connections all the time, ie. Not Using a Connection Pool.
This hurts because of the TCP and TLS handshake - in the worst case it means a secondary HTTP request (for the CRL) every time...
Redoing the same operation over and over again.
Tomcat likes to reload it's configuration, the JVM reads /proc/stat every so often, etc.
Unexpected delays - a nice one was a DNS misconfiguration so that every lookup had to wait for a timeout before using the correct (available) DNS server.
Things like these also help to enhance resilience -- in a datacenter the DNS timeouts can be set to a second instead of the default 3, for example!
Applications sleeping instead of working.
This can show up in multiple ways -- Race conditions between a polling loop and the event, actual "clock_nanosleep" (and similar) call, or the famous "Speed-Up Loop" (visible via delays between syscalls).
Another non-optimal behaviour would be lots of small writes (or reads).

Summary and conclusion

While the strace output looks daunting (from the amount of data, the many details, and the hard-to-get big-picture overview), programmers should be in a good position to interpret it -- their programs are causing it, and it's "just" (ha!) a more detailed view of the individual program statements.

BTW, I'm well aware of "perf trace" -- but that one doesn't show contents (no file names, no actual bytes being send, received, written, or read; and in an OpenShift container it didn't even give syscall names!), so while I'm all for decreased overhead, it doesn't give me the data I need (yet).

Last, but not least -- if you already have some idea about the problem and need more specific tracing/debugging, eBPF is your friend!

Live Hack: Controlling a Smartphone via Laser

In the USENIX Security Symposium 2020 "Laser-Based Audio Injection on Voice-Controllable Systems" was presented: making MEMS-microphones believe that audio input happens via amplitude-modulated light from far away. We reproduced these efforts, to show that such threats are legit and shouldn't be underestimated.

more Live Hack: Controlling a Smartphone via Laser

Troubleshooting by strace

Compare (hardware) configurations

Network (and storage) performance diagnosis

Quick architecture check

Summary and conclusion

author

Philipp Marek

Short intro to "Grants4Companies"

Live Hack: Controlling a Smartphone via Laser