Blogged Answers: Debugging Tips

This is a post in the Blogged Answers series.


A few tips on debugging from experience, plus some stories

Last week's #devdiscuss discussion was on the topic of debugging. This is something I have a lot of opinions and advice on, so I replied with a quick list of my suggestions for better debugging. It's worth recording those in blog post form as well.

Note: Ben McCormick pointed me to a very similar post he wrote a few years back. It's excellent, and you should read it too: The Debugging Toolbox

Mark's Debugging Advice

1) Reproduce the Issue

Reproducing the problem is key. If you can't reproduce the issue, it's harder to figure out what's going on and iterate on fixes. More importantly, if you can't repro, you probably can't prove that a "fix" actually works.

2) Errors Provide Useful Information

Error messages and stack traces are important. @swyx quoted me in a Dev post last year on this topic: How to Google Your Errors . They usually give you a good starting point for where the problem is and why it's happening. Read the messages. Google them if needed.

3) Learn to Use Debuggers

Know your tools. Most debuggers work the same no matter what IDE / DevTools you're using. Understand breakpoints, watches, stepping. Use more advanced features like conditional breakpoints, or "watchpoints" to just log a message instead of stopping.

4) Print Logging is Important

Sometimes an IDE debugger won't cut it, especially if it's a remote system or multi-threaded environment. Good ol' console/print debugging is still important. It helps to have lots of existing logging with different levels that you can enable via config files.

5) Debug With a Plan

Go in with a hypothesis. Understand how the code should behave first. Look at the actual behavior. You can probably make educated guesses for where things are diverging. Focus on those areas. Don't randomly tweak. Change one thing at a time and compare results. Narrow down possible causes based on that.

6) Don't Be Afraid

Experience helps, but don't be afraid to dive in if you're new, because that's how you get experience :) Even if an issue looks like it's really complex, take some time to break it down into smaller pieces. Dave Ceddia and Gosha Arinich both wrote good advice on the topic of breaking down problems from the perspective of building a new project and learning new topics, but the same concept applies to debugging.

7) Know When To Keep Going, and When To Stop

Persistence is good, and it takes time to build up mental context on an issue, but sometimes you gotta take a break. Plenty of times I've left work, gone home, relaxed, and figured out what the issue was overnight or in the morning.

On that note, I'll recap a couple of my best debugging stories.

Tales of Debugging

The Case of the Wrong Case

A few years ago. I had a Python service that was doing a bunch of number crunching, so I used Cython to speed up the math-intensive portions of the code. Cython is a tool that can compile plain Python code to C so that it runs faster. If you add extra type declarations, it can generate more efficient C code, because it knows what each variable type is.

Cython has a few ways you can declare types. You can rename your .py files to .pyx and add types directly in the code; you can use decorators to declare the types; or you can add separate .pxd files that declare the types. I set it up using the side-by-side file approach, which lets you leave your original Python files alone. For this to work, the .pxd file with the types has to match the exact name of the original .py file. So, if I had FileA.py, I'd also have FileA.pxd.

I added type definitions for the Python functions and variables and referenced several C math functions from <math.h>. This service primarily runs on Linux, but I do most of my development on Windows. I did the initial changes on my local Windows machine, tested the code, and it seemed to be working great.

Unfortunately, when I went to run this on Linux, it compiled fine, but started throwing some weird exceptions buried deep down in the code. I hadn't actually written the math portion of this service myself, so I wasn't familiar with the actual logic. I traced the problem pretty far down into the code, and finally concluded that a particular function was trying to call the cos() trigonometry function, but the call was failing because cos didn't seem to exist. This didn't make any sense, because I was explicitly importing the C version of cos from <math.h> if using Cython, and importing the normal Python function from the math module otherwise.

Took me about 6 hours of tracing, investigating, and steadily growing confusion before I finally figured out what was going on: there was a single letter casing mismatch between a .py file and its corresponding .pxd file. One had a capital 'O', the other had a lower-case 'o'. This worked fine on Windows, because Windows file systems are case-insensitive (so filea.txt is considered the same as FileA.txt). But, Linux file systems are case-sensitive, so files can have the same name with different letter cases. Because of this, the Cython compiler never actually found the .pxdfile to match the original .py file, didn't get the right imports, and thus stuff eventually exploded. The fix was to change the capital 'O' to a lower-case 'o', and it worked.

Definitely a pretty high hours-of-work to number-of-characters-changed ratio :)

A Twisted Tale

Recently, we got reports that one of our Python services appeared to be locking up entirely. This particular service acts as a reverse proxy, forwarding HTTP requests to the rest of our system, and also handling authentication requests.

The main symptom was that none of our app would load in the browser. Oddly, there were no obvious errors in the logs, but that was because the logs for the service stopped entirely at about the time the symptoms occurred.

This Python service is built on the Twisted network framework. Twisted uses an "event loop", similar to how Javascript and Node.js work. It has a queue of events, and for each event, runs the associated code to completion. This is great if you have lots of IO that happens in the background. But, it also means that if a particular script takes a long time to run, it can block the entire event loop and keep any other code from running.

I was stumped by the issue for a while, but I had a couple guesses. My hypothesis was that one of the external calls this service was making was never getting a response, and thus blocking the event loop from continuing.

We cranked up the log levels for this service, restarted things, and waited for it to happen again. When it did, I took a look at the logs. Sure enough, the last thing the service was trying to do was make an external request to an LDAP server to query some user data.

I did some testing and confirmed that all of the external requests were happening on the main thread, same as the event loop (which I did by modifying the logging config to print out the name of the thread where the log statement occurred). I moved all those external requests to happen in background threads instead, and also fixed the timeout settings on the calls so that they don't block indefinitely.

The changes resolved the issue. We later found out that the issues did actually correspond to times when the LDAP server had been experiencing 100% CPU usage, thus confirming the external cause.

The moral of the story is that knowing how the underlying technologies behave allowed me to make an educated guess for the likely cause of the problem.

A Twisted Tale, part 2

Shortly after I originally published this post, a similar issue popped up. The same service was freezing in the test environment. When I looked, there were no errors in the logs, and we couldn't reproduce the problem. We scratched our heads, shrugged, and agreed to keep an eye on the problem.

A couple months went by, and suddenly the service locked up again. As before, there were no errors in the log. I guessed that it was a similar event loop blocking issue, but I couldn't figure out where and why that might be happening. After all, I'd already moved all external calls into background threads.

Fortunately, this time the error was happening consistently on one test server. I was able to crank up the debug log levels to max and deploy multiple test builds to that system. Each time, I added more logging statements to try to narrow down where in the code this might be happening.

Finally, after several days of editing -> building -> deploying -> waiting -> reading logs, I finally was able to pinpoint exactly where in the code the service was stopping. It was a chunk of code I'd written a year and a half earlier to handle multiple simultaneous login sessions from one user. If the user logged in one too many times, the oldest session would get kicked.

I'd tried to be "safe" in my code, and written code along the lines of while(len(userSessions) >= MAX_LOGINS) removeOldestSession(userSessions). But, under certain rare conditions, that session might not actually get removed. Yup - I'd written an infinite loop, which was indeed blocking the main thread. I wound up rewriting it as a for loop instead to only remove the necessary sessions.

Lessons from this experience:

  • having configurable logging is good
  • once again, reproducing the issue is key
  • sometimes you just gotta debug via print statements
  • don't ever write while loops like that again :)

Final Thoughts

Debugging is a combination of gathering information, problem solving, and knowing how to use the tools you have available. Articles and courses can teach you a lot of things, but debugging is something that I feel really requires practice and experience.

So, the next time you see your application spewing errors into the logs, or something just "doesn't work", DON'T PANIC! You can figure this out :)


This is a post in the Blogged Answers series. Other posts in this series:


Author Avatar

Mark Erikson

Collector of interesting links, answerer of questions