Debugging a Lagging Mission-Critical Server

When faced with a mission-critical server experiencing severe lag and a reboot is not an option, a systematic approach is essential to identify and resolve the issue without causing further disruption.

Initial Assessment and Resource Monitoring

First, I would gain an overview of the system's current state by examining key resource utilization metrics. This provides insight into potential bottlenecks.

uptime: Check how long the server has been running. This can indicate if the issue is related to a recent deployment or a gradual resource degradation over time.
top or htop: These commands offer a real-time view of CPU usage, memory consumption, and running processes. I'd look for processes consuming a disproportionate amount of CPU or memory. htop is preferred for its better interactivity and visualization.
vmstat 1: Reports virtual memory statistics, which can indicate memory pressure or excessive swapping. The 1 argument specifies that the report should be updated every second.
iostat -xz 1: This command provides detailed disk I/O statistics. High %util values for a disk indicate that it is saturated, potentially causing the lag. The -x flag provides extended statistics, and the -z suppresses reports when there is no activity.
df -h: Check disk space utilization. A full disk can severely impact performance.
free -m: Display the amount of free and used memory in the system. This helps assess overall memory availability.

Identifying Resource-Intensive Processes

If top or htop reveals processes with high CPU or memory usage, I would investigate them further.

ps aux --sort=-%cpu or ps aux --sort=-%mem: List all processes sorted by CPU or memory usage, respectively. This provides a comprehensive view of resource consumption.
strace -p <pid>: Attach to a process using its PID and trace its system calls. This reveals what the process is doing, such as reading/writing files, making network connections, etc. This can help pinpoint slow or problematic operations.
lsof -p <pid>: List open files for a specific process. This can expose files being excessively accessed or locked.
pmap <pid>: Show the memory map of a process. Useful for understanding how memory is being allocated and identifying potential memory leaks.

Network Bottleneck Analysis

If the application involves network communication, a network bottleneck could be the cause of the lag.

netstat -an | grep ESTABLISHED: Show established network connections. Look for a high number of connections to a specific IP address or port, which could indicate a problem with a remote server or a denial-of-service attack.
tcpdump -i <interface> -n -s 0 -w capture.pcap: Capture network traffic on a specific interface. The -i option specifies the interface (e.g., eth0), -n prevents reverse DNS lookups, -s 0 captures the entire packet, and -w saves the capture to a file. The capture file can be analyzed using Wireshark or tshark to identify slow connections or packet loss.
iftop -i <interface>: Display a real-time bandwidth usage table. This shows which connections are consuming the most bandwidth.
mtr <hostname>: Combines traceroute and ping to display the network path to a host and the latency at each hop. This can help identify network segments with high latency or packet loss.

File System Issues

Slow file system operations can also contribute to lag.

iotop: Similar to top, but displays real-time disk I/O usage by process. This helps identify processes that are heavily reading from or writing to disk.
find / -type f -size +100M -print: Find large files on the system. Large log files or temporary files can fill up disk space and slow down the system.
du -hsx * | sort -rh | head -10: Summarize disk usage for each directory and sort by size. This can quickly identify directories that are consuming a lot of disk space.

Application-Specific Debugging

Depending on the application running on the server, application-specific debugging tools may be helpful.

Java: Use tools like jstack, jmap, and visualVM to analyze thread dumps, heap dumps, and memory usage.
Databases: Use database-specific profiling tools (e.g., slow query log in MySQL, pg_stat_statements in PostgreSQL) to identify slow queries.
Web Servers: Examine web server logs (e.g., Apache access.log and error.log, Nginx access.log and error.log) for errors or slow requests.

Diagnosing the Root Cause

After gathering the above data, correlate the information to identify the root cause of the lag. For example:

High CPU usage + specific process: The process is likely the source of the problem. Analyze its behavior using strace or application-specific debugging tools.
High disk I/O + specific process: The process is likely reading or writing a large amount of data to disk. Investigate the process's disk access patterns.
High network usage + specific connection: The connection may be experiencing high latency or packet loss. Use tcpdump or mtr to analyze the network traffic.
Memory exhaustion: The server is running out of memory. Identify memory leaks or processes that are consuming excessive amounts of memory.

Potential Solutions (Without Rebooting)

Once the root cause is identified, attempt to mitigate the issue without rebooting the server.

Kill runaway processes: If a process is consuming excessive resources and is not critical, kill it using kill <pid>. Ideally, send a SIGTERM signal first to allow the process to gracefully shut down, and then SIGKILL if it doesn't respond.
Restart problematic services: If a specific service is causing the problem, attempt to restart it using systemctl restart <service>. This may resolve temporary issues without requiring a full reboot.
Increase memory limits: If a process is running out of memory, increase its memory limits using ulimit -m <new_limit> (carefully, to avoid impacting other processes).
Optimize database queries: If slow database queries are the problem, optimize the queries by adding indexes or rewriting them.
Rate limiting: If there are too many requests coming in and overwhelming the server, implement rate limiting using tools like iptables or application-specific mechanisms.
Clear disk space: If the disk is full, delete unnecessary files, such as old log files or temporary files. Be extremely careful when deleting files.
Adjust kernel parameters: In some cases, adjusting kernel parameters (e.g., net.core.somaxconn, net.ipv4.tcp_tw_reuse) can improve network performance. Use sysctl -w <parameter>=<value> to adjust parameters. Note that these changes will be lost on reboot unless you add them to /etc/sysctl.conf.

Conclusion

By systematically monitoring resources, identifying problematic processes, analyzing network traffic, and investigating file system issues, it's possible to diagnose and resolve the lag without resorting to a reboot. Each step provides a piece of the puzzle, ultimately leading to the root cause and appropriate solution. Remember to document all actions taken and their effects, as this information will be valuable for future troubleshooting and preventing recurrence of the issue.

How do you debug a lagging, mission-critical server using only a terminal?