Strace Command: The Ultimate Tool for Tracing Linux System Calls!

1. What is strace

In simple terms, <span>strace</span> acts like a “listener” between your program and the operating system.

Does your program want to read a file? Make a system call? Send a network request? Or make a system call?<span>strace</span> sits in the middle and records all these conversations.

The most crucial part is: No need to modify the code, no need to restart the program, no need to add logs, just start tracing.

When it comes to troubleshooting, this tool is simply amazing!

First, take a look at the example below:

# Trace a running process to see what it is doing
# [email protected]
$ strace -p 12345

connect(3, {sa_family=AF_INET, sin_port=htons(6379), 
        sin_addr=inet_addr("192.168.1.100")}, 16) = -1 ETIMEDOUT 
        (Connection timed out) &lt;30.002345&gt;

# The result is clear, the program cannot connect to redis
# Just one command is needed, this is the charm of strace — no need to look at the source code, no need to add logs, just see the dialogue between the program and the operating system.

2. Basic Usage, Get Started in 3 Minutes

# Trace all system calls of a command
# See what the ls command actually does
# [email protected] omitted many return results..
$ strace ls

execve("/usr/bin/ls", ["ls"], 0x7ffc...) = 0
brk(NULL)                               = 0x55555576d000
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, ".", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 3
getdents64(3, /* 10 entries */, 32768)  = 320
write(1, "file1.txt\nfile2.txt\n", 20)  = 20
close(3)                                = 0
exit_group(0)                           = ?

Understanding these few lines, you will know:
# execve - Load the ls program
# openat - Open the current directory
# getdents64 - Read directory entries
# write - Output to terminal
# close - Close file
# exit_group - Program exit

3. Practical Examples

3.1. Third-party API Call Failure

# -e trace=network only see network-related calls
# -s 1000 set the string display length to 1000 (default is only 32)
# [email protected], omitted part of the output
$ strace -e trace=network -s 1000 curl https://api.opsnot.com/v1/user

socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3                       # TCP socket created successfully
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(443), 
        sin_addr=inet_addr("1.2.3.4")}, 16) = 0                     # TCP connection established successfully (connect returns 0)
sendto(3, "GET /v1/user HTTP/1.1\r\nHost: api.opsnot.com\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\n\r\n", 87, MSG_NOSIGNAL, NULL, 0) = 87  # Request sent successfully
recvfrom(3, "HTTP/1.1 401 Unauthorized\r\nContent-Type: application/json\r\nContent-Length: 52\r\n\r\n{\"error\":\"invalid token\",\"code\":401}", 16384, 0, NULL, NULL) = 145  # Received server response

In the end, it can be seen that the request was sent successfully and returned 401, most likely due to an expired token.

3.2. Who is the Performance Killer

A certain data processing script runs very slowly, but CPU and memory usage are low.

# -c statistics mode: shows the number of times each system call was made and the total time spent
# After the program runs, a statistical table will be output
# [email protected]
$ strace -c python process_data.py

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.87   45.234567      452345       100           fsync
  0.06    0.028901          28      1000           write
  0.03    0.012345          12      1000           read
  0.02    0.009876           9      1100           openat
  0.01    0.004567           4      1000           close
  0.01    0.003210           3      1000           fstat
------ ----------- ----------- --------- --------- ----------------
100.00   45.293466                  5200           total

# It can be seen: 99.87% of the script's time is spent on fsync, only called 100 times, but each time takes 450+ milliseconds, prompting the developer to check the code and find that it was fsyncing after processing each piece of data (strong write).
# Finding the problem makes it easy to solve, changing the script's disk flushing logic to batch processing, flushing the disk every 1000 or 5000 pieces of data, depending on the actual scenario, such as disk performance.

3.3. Java Service Starts Slowly

# -tt shows precise timestamps (microsecond level)
# -T shows the time spent on each system call
# [email protected]
$ strace -tt -T java -jar app.jar

10:30:15.123456 execve("/usr/bin/java", ["java", "-jar", "app.jar"], ...) = 0
10:30:15.234567 openat(AT_FDCWD, "/etc/hosts", O_RDONLY) = 3 &lt;0.000234&gt;
10:30:15.234567 close(3) = 0 &lt;0.000012&gt;

# JVM initialization
10:30:15.345678 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 5 &lt;0.000045&gt;
10:30:15.345678 fcntl(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 &lt;0.000023&gt;
10:30:15.345678 connect(5, {sa_family=AF_INET, sin_port=htons(3306), 
                 sin_addr=inet_addr("10.0.0.100")}, 16) = -1 EINPROGRESS 
                 (Operation now in progress) &lt;0.000123&gt;

# Non-blocking connection waiting
10:30:15.345789 poll([{fd=5, events=POLLOUT}], 1, 30000) = 0 (Timeout) &lt;30.000456&gt;

# Check connection result
10:30:45.346245 getsockopt(5, SOL_SOCKET, SO_ERROR, [ETIMEDOUT], [4]) = 0 &lt;0.000034&gt;
10:30:45.346279 close(5) = 0 &lt;0.000015&gt;

# Application layer retry (new socket)
10:30:45.346294 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 5 &lt;0.000041&gt;
10:30:45.346335 connect(5, {sa_family=AF_INET, sin_port=htons(3306), 
                 sin_addr=inet_addr("10.0.0.100")}, 16) = -1 ETIMEDOUT 
                 (Connection timed out) &lt;30.001234&gt;
10:30:45.347569 close(5) = 0 &lt;0.000013&gt;

# 10:30:45.346245 getsockopt(5, SOL_SOCKET, SO_ERROR, [ETIMEDOUT], [4]) = 0 &lt;0.000034&gt; this line of information hits the point:
# getsockopt(...): Get socket options
# SO_ERROR: Query socket error status
# [ETIMEDOUT]: Confirm connection timeout error
# This is the key step to confirm that the connection has truly failed!

Subsequently, tcpdump can be used to continue troubleshooting

# For example, tcpdump to verify network behavior
tcpdump -i any host 10.0.0.100 and port 3306

<span>tcpdump</span> detailed usage, please refer to my previous article “Operational Eye of Fire – tcpdump Packet Capture Practical Manual”

4. Common Parameter Quick Reference

# Basic tracing
# [email protected]
$ strace ./program              # Trace a newly started program
$ strace -p 1234                # Trace a running process (PID=1234)
$ sudo strace -p 1234           # Tracing other users' processes requires root

# Filter system calls
$ strace -e trace=file          # Only see file operations (open, read, write, close...)
$ strace -e trace=network       # Only see network operations (socket, connect, send, recv...)
$ strace -e trace=process       # Only see process operations (fork, exec, wait...)
$ strace -e trace=signal        # Only see signal operations (kill, signal...)
$ strace -e trace=open,read     # Only see specified system calls

# Time and performance analysis
$ strace -tt                    # Show timestamps (microsecond precision)
$ strace -T                     # Show the time spent on each call
$ strace -r                     # Show relative time (time of each call from the previous call)
$ strace -c                     # Statistics mode (summarize all calls)

# Output control
$ strace -s 1000                # String display length (default 32)
$ strace -o output.txt          # Output to file
$ strace -f                     # Trace child processes (follow forks)
$ strace -ff -o trace           # Each process outputs to a separate file (trace.PID)

# Combined usage (most common)
$ strace -tt -T -e trace=file -s 200 ./app
# Show timestamps + time spent + only see file operations + string length 200

5. Advanced Techniques

5.1. Trace Multi-process Programs

Many services will fork multiple child processes (such as Nginx, Gunicorn), use <span>-f</span> to trace all processes:

# -f trace all child processes
# -o output to file (the terminal will be messy)
# [email protected]
$ strace -f -o trace.log nginx

# View output, each line will have a PID in front
$ head trace.log

1234  execve("/usr/sbin/nginx", ["nginx"], ...) = 0
1234  socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
1234  bind(3, {sa_family=AF_INET, sin_port=htons(80), ...}, 16) = 0
1234  listen(3, 511) = 0
1235  clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|...) = 1235
1235  accept4(3, {sa_family=AF_INET, ...}, [16], ...) = 4

If there are many child processes, you can let each process output to a separate file:

# -ff each process one file
# Will generate trace.1234, trace.1235, trace.1236...
# [email protected]
$ strace -ff -o trace gunicorn app:app

5.2. Combine grep to Filter Output

<span>strace</span> output is usually large, use grep to quickly find the desired information:

# Only see failed system calls (return value is -1)
# [email protected]
$ strace ./app 2&gt;&amp;1 | grep "= -1"

# Only see slow system calls (time spent over 1 second)
$ strace -T ./app 2&gt;&amp;1 | grep "&lt;[1-9]\.

# See all operations on a certain file
$ strace -e trace=file ./app 2&gt;&amp;1 | grep "config.json"

# See all network connections
$ strace -e trace=network ./app 2&gt;&amp;1 | grep "connect"

# Count the occurrence of each error type
$ strace ./app 2&gt;&amp;1 | grep "= -1" | awk '{print $NF}' | sort | uniq -c

5.3. Performance Impact and Precautions

<span>strace</span> will slow down the program because every system call needs to be intercepted and recorded. When used online:

# ❌ Bad: Trace all calls, the program will become very slow
# [email protected]
$ strace -p 1234

# ✅ Good: Only trace the necessary calls to reduce overhead
$ strace -e trace=network -p 1234

# ✅ Good: Short-term tracing, stop after a few seconds
$ timeout 5 strace -p 1234

# ✅ Good: Use statistics mode, less overhead
$ strace -c -p 1234


Production environment usage recommendations:
# Keep tracing time within a few seconds to tens of seconds
# Use -e to only trace the necessary system calls
# Avoid tracing processes with high-frequency operations (like excessive logging)
# Prefer using -c statistics mode

5.4. Trace Processes Inside Docker Containers

# Method 1: Find the real PID of the process inside the container on the host
# [email protected]
$ docker top mycontainer

UID     PID     PPID    CMD
root    12345   12344   python opsnot.py

$ sudo strace -p 12345

# Method 2: Enter the PID namespace of the container  
$ docker inspect --format '{{.State.Pid}}' mycontainer 
12344

$ nsenter -t 12344 -p -m strace -p 1
# Trace process 1 from the perspective inside the container

<span>docker inspect</span> detailed usage, please refer to my previous article “Docker Inspect, A Command Worthy of Its Own Page”

5.5. Save and Analyze Trace Results

# Save complete trace records
# [email protected]
$ strace -tt -T -f -s 500 -o trace.log ./app

# Analyze file operations: which files were accessed
$ grep "openat" trace.log | awk '{print $3}' | sort | uniq

# Analyze time spent: find the 10 slowest system calls
$ grep "&lt;" trace.log | awk '{print $NF}' | sed 's/[&lt;&gt;]//g' | sort -rn | head -10

# Count system call occurrences
$ awk '{print $2}' trace.log | cut -d'(' -f1 | sort | uniq -c | sort -rn

# Analyze errors: count various errors
$ grep "= -1" trace.log | awk '{print $NF}' | sort | uniq -c

6. Practical Tips

Program Starts Slowly?

# See which system call is causing the delay
# [email protected]
$ strace -tt -T -o startup.log ./app
$ grep "&lt;[0-9]\." startup.log | sort -t'&lt;' -k2 -rn | head -10

File Not Found?

# See where the program is looking for files
# [email protected]
$ strace -e trace=openat ./app 2&gt;&amp;1 | grep "ENOENT"

Network Connection Failed?

# See DNS resolution, TCP connection, data sending and receiving
# [email protected]
$ strace -e trace=network -s 500 ./app 2&gt;&amp;1 | grep -E "connect|send|recv"

Program Hangs?

# See what the running process is waiting for
# [email protected]
$ sudo strace -p $(pgrep myapp)

Performance Issues?

# Count which system call takes the most time
# [email protected]
$ strace -c ./app

7. When to Use strace & What Pitfalls to Avoid

Suitable Scenarios:

  • Program behavior is abnormal, and you don’t know what it is doing
  • Slow startup, hangs, or freezes
  • Configuration files not found, read failures
  • Network connection issues
  • Disk I/O problems
  • Quickly locate issues when there is no time to add logs

Pitfalls to Watch Out For:

7.1. Output is in stderr

# ❌ Error: This way you can only see the program output, not strace
# [email protected]
$ strace ./app &gt; output.txt

# ✅ Correct: Redirect stderr
$ strace ./app 2&gt; trace.txt

# ✅ Correct: Or use the -o parameter
$ strace -o trace.txt ./app

7.2. Tracing Others’ Processes Requires Permissions

# ❌ Error: Insufficient permissions
# [email protected]
$ strace -p 1234
strace: attach: ptrace(PTRACE_SEIZE, 1234): Operation not permitted

# ✅ Correct: Use sudo
$ sudo strace -p 1234

7.3. System Calls May Differ Across Architectures, For Example:

# System calls on x86_64
$ strace ./app
openat(AT_FDCWD, "file.txt", O_RDONLY) = 3

# On arm64 it may be
openat2(AT_FDCWD, "file.txt", O_RDONLY) = 3

7.4. Add -f to Trace Child Processes

# Trace all child processes
$ strace -f nginx

7.5. Why is the Output So Messy?

<span>strace</span> output is in stderr (error output), mixed with the program’s normal output:

# Method 1: Only save strace output
# [email protected]
$ strace -o trace.log ./app

# Method 2: Separate program output and strace output
$ ./app &gt; app.log 2&gt; &gt;(strace -o trace.log -p $$)

8. Conclusion

<span>strace</span> is a powerful tool for troubleshooting issues in Linux systems, capable of pinpointing many strange problems.

The next time you encounter an issue, don’t rush to look at the source code, add logs, or restart the service; first, give <span>strace</span> a try.

As a reminder, <span>strace</span> execution will have a performance impact, but short-term use is usually acceptable.

Recommendation: Validate the impact in a test environment before using it in production.

Additionally, <span>strace</span> is also widely used in containers, and I will write a dedicated article on strace usage in containers when I have the opportunity.

This article is organized by opsnot.com, please indicate the source when reprinting, click the card below to follow.

Leave a Comment