Blogged Answers: Debugging Tips

This is a post in the Blogged Answers series.


A few tips on debugging from experience, plus some stories

Last week's #devdiscuss discussion was on the topic of debugging. This is something I have a lot of opinions and advice on, so I replied with a quick list of my suggestions for better debugging. It's worth recording those in blog post form as well.

Note: Ben McCormick pointed me to a very similar post he wrote a few years back. It's excellent, and you should read it too: The Debugging Toolbox

Mark's Debugging Advice 🔗︎

1) Reproduce the Issue 🔗︎

Reproducing the problem is key. If you can't reproduce the issue, it's harder to figure out what's going on and iterate on fixes. More importantly, if you can't repro, you probably can't prove that a "fix" actually works.

2) Errors Provide Useful Information 🔗︎

Error messages and stack traces are important. @swyx quoted me in a Dev post last year on this topic: How to Google Your Errors . They usually give you a good starting point for where the problem is and why it's happening. Read the messages. Google them if needed.

3) Learn to Use Debuggers 🔗︎

Know your tools. Most debuggers work the same no matter what IDE / DevTools you're using. Understand breakpoints, watches, stepping. Use more advanced features like conditional breakpoints, or "watchpoints" to just log a message instead of stopping.

4) Print Logging is Important 🔗︎

Sometimes an IDE debugger won't cut it, especially if it's a remote system or multi-threaded environment. Good ol' console/print debugging is still important. It helps to have lots of existing logging with different levels that you can enable via config files.

5) Debug With a Plan 🔗︎

Go in with a hypothesis. Understand how the code should behave first. Look at the actual behavior. You can probably make educated guesses for where things are diverging. Focus on those areas. Don't randomly tweak. Change one thing at a time and compare results. Narrow down possible causes based on that.

6) Don't Be Afraid 🔗︎

Experience helps, but don't be afraid to dive in if you're new, because that's how you get experience :) Even if an issue looks like it's really complex, take some time to break it down into smaller pieces. Dave Ceddia and Gosha Arinich both wrote good advice on the topic of breaking down problems from the perspective of building a new project and learning new topics, but the same concept applies to debugging.

7) Know When To Keep Going, and When To Stop 🔗︎

Persistence is good, and it takes time to build up mental context on an issue, but sometimes you gotta take a break. Plenty of times I've left work, gone home, relaxed, and figured out what the issue was overnight or in the morning.

On that note, I'll recap a couple of my best debugging stories.

Tales of Debugging 🔗︎

The Case of the Wrong Case 🔗︎

A few years ago. I had a Python service that was doing a bunch of number crunching, so I used Cython to speed up the math-intensive portions of the code. Cython is a tool that can compile plain Python code to C so that it runs faster. If you add extra type declarations, it can generate more efficient C code, because it knows what each variable type is.

Cython has a few ways you can declare types. You can rename your .py files to .pyx and add types directly in the code; you can use decorators to declare the types; or you can add separate .pxd files that declare the types. I set it up using the side-by-side file approach, which lets you leave your original Python files alone. For this to work, the .pxd file with the types has to match the exact name of the original .py file. So, if I had FileA.py, I'd also have FileA.pxd.

I added type definitions for the Python functions and variables and referenced several C math functions from <math.h>. This service primarily runs on Linux, but I do most of my development on Windows. I did the initial changes on my local Windows machine, tested the code, and it seemed to be working great.

Unfortunately, when I went to run this on Linux, it compiled fine, but started throwing some weird exceptions buried deep down in the code. I hadn't actually written the math portion of this service myself, so I wasn't familiar with the actual logic. I traced the problem pretty far down into the code, and finally concluded that a particular function was trying to call the cos() trigonometry function, but the call was failing because cos didn't seem to exist. This didn't make any sense, because I was explicitly importing the C version of cos from <math.h> if using Cython, and importing the normal Python function from the math module otherwise.

Took me about 6 hours of tracing, investigating, and steadily growing confusion before I finally figured out what was going on: there was a single letter casing mismatch between a .py file and its corresponding .pxd file. One had a capital 'O', the other had a lower-case 'o'. This worked fine on Windows, because Windows file systems are case-insensitive (so filea.txt is considered the same as FileA.txt). But, Linux file systems are case-sensitive, so files can have the same name with different letter cases. Because of this, the Cython compiler never actually found the .pxdfile to match the original .py file, didn't get the right imports, and thus stuff eventually exploded. The fix was to change the capital 'O' to a lower-case 'o', and it worked.

Definitely a pretty high hours-of-work to number-of-characters-changed ratio :)

A Twisted Tale 🔗︎

Recently, we got reports that one of our Python services appeared to be locking up entirely. This particular service acts as a reverse proxy, forwarding HTTP requests to the rest of our system, and also handling authentication requests.

The main symptom was that none of our app would load in the browser. Oddly, there were no obvious errors in the logs, but that was because the logs for the service stopped entirely at about the time the symptoms occurred.

This Python service is built on the Twisted network framework. Twisted uses an "event loop", similar to how Javascript and Node.js work. It has a queue of events, and for each event, runs the associated code to completion. This is great if you have lots of IO that happens in the background. But, it also means that if a particular script takes a long time to run, it can block the entire event loop and keep any other code from running.

I was stumped by the issue for a while, but I had a couple guesses. My hypothesis was that one of the external calls this service was making was never getting a response, and thus blocking the event loop from continuing.

We cranked up the log levels for this service, restarted things, and waited for it to happen again. When it did, I took a look at the logs. Sure enough, the last thing the service was trying to do was make an external request to an LDAP server to query some user data.

I did some testing and confirmed that all of the external requests were happening on the main thread, same as the event loop (which I did by modifying the logging config to print out the name of the thread where the log statement occurred). I moved all those external requests to happen in background threads instead, and also fixed the timeout settings on the calls so that they don't block indefinitely.

The changes resolved the issue. We later found out that the issues did actually correspond to times when the LDAP server had been experiencing 100% CPU usage, thus confirming the external cause.

The moral of the story is that knowing how the underlying technologies behave allowed me to make an educated guess for the likely cause of the problem.

A Twisted Tale, part 2 🔗︎

Shortly after I originally published this post, a similar issue popped up. The same service was freezing in the test environment. When I looked, there were no errors in the logs, and we couldn't reproduce the problem. We scratched our heads, shrugged, and agreed to keep an eye on the problem.

A couple months went by, and suddenly the service locked up again. As before, there were no errors in the log. I guessed that it was a similar event loop blocking issue, but I couldn't figure out where and why that might be happening. After all, I'd already moved all external calls into background threads.

Fortunately, this time the error was happening consistently on one test server. I was able to crank up the debug log levels to max and deploy multiple test builds to that system. Each time, I added more logging statements to try to narrow down where in the code this might be happening.

Finally, after several days of editing -> building -> deploying -> waiting -> reading logs, I finally was able to pinpoint exactly where in the code the service was stopping. It was a chunk of code I'd written a year and a half earlier to handle multiple simultaneous login sessions from one user. If the user logged in one too many times, the oldest session would get kicked.

I'd tried to be "safe" in my code, and written code along the lines of while(len(userSessions) >= MAX_LOGINS) removeOldestSession(userSessions). But, under certain rare conditions, that session might not actually get removed. Yup - I'd written an infinite loop, which was indeed blocking the main thread. I wound up rewriting it as a for loop instead to only remove the necessary sessions.

Lessons from this experience:

  • having configurable logging is good
  • once again, reproducing the issue is key
  • sometimes you just gotta debug via print statements
  • don't ever write while loops like that again :)

Node What You're Doing 🔗︎

I recently moved to a different project, which is unfortunately based on the MEAN stack and AngularJS. When I joined, it was still being deployed using Node 6.x.

I updated our build tooling and deploy scripts to support running the app on Node 10.x instead, and deployed a new build with Node 10 to the test server for the first time. The app loaded and looked okay, but then we saw that users couldn't log in. The signin API request was returning an HTTP 422 error with no real message attached.

I'd recently updated the password for a service account related to our LDAP server, so I tried changing that again to see if it would fix things. No luck.

The previous build deployed with Node 6 ran fine and let users log in, so it clearly had to be due to something I'd done. So, it was time to dig in.

I SSHed into the test server and began hand-editing app source code files on disk, stopping and restarting the app server every time to look at the logs. (VS Code's "Remote SSH" extension was a big help here, as I could easily edit files remotely without having to open up Vim every time.)

I knew that our app server was definitely receiving the request, so the HTTP error had to be happening somewhere in the processing chain. To help figure out where it was breaking, I turned on Express's debug logging via a DEBUG=express:* environment variable. That logging prints out the sequence of middleware that are hit every time a request comes in, including the actual JS variable name of the middleware functions. (Note: please give your Express middleware functions meaningful names other than calling them middleware!.)

After poking through several areas of the still mostly unfamiliar codebase, I finally traced it down to a "users auth controller" file in the server logic, which was calling Passport to authenticate via LDAP. I saw a spot in the code where it was receiving err as a callback, and added a console.log(err) line. That finally told me the error was related to SSL certificates somehow. Searching for the error turned up issues that told me about Node 8+ not accepting self-signed certificates, and tlsOptions: {rejectUnauthorized: false} somewhere in the LDAP configuration to fix that.

Lessons learned:

  • Once again, print logging was critical
  • Finding the actual error message was the key to solving the problem
  • It would have been faster if the actual error was getting logged
  • Not being familiar with the codebase slowed me down
  • Being able to hand-edit code and restart things was useful, as was remote editing via VS Code
  • Knowing that it was almost definitely due to something I'd changed in this deploy helped me narrow things down

Expressly Forbidden 🔗︎

Also on this project, I found myself wanting to proxy requests from the existing app server to a new app server I was setting up side-by-side as a migration path, where a request to localhost:0880/some/path gets forwarded to localhost:3000/some/path.

I added a proxying middleware to Express in the existing app server, and this worked fine when hitting the app server on 8080 directly. But, if I then fired up the Create-React-App dev server and browsed to localhost:4000/some/path, it would get redirected to an index page that also failed to load right, instead of correctly passing the request on to the app server on 8080. (This was particularly bad because my goal was to add <iframe>s pointing to the proxied URLs to expose new content in an existing page.)

I knew the requests had to be failing inside of CRA's dev server somewhere. I'd actually dug through the guts of CRA and Webpack Dev Server's Express setup a year earlier, so I was unfortunately familiar with where this was going to end up.

Sure enough, I soon found myself with a dozen files from node_modules open in my editor, adding console.log() statements everywhere that an Express middleware was being added, and suing DEBUG=express:* to see how the middleware were processing the request.

After several hours of debugging, I finally traced it down to the "history API fallback" middleware. Its purpose is to detect what look like hard page refreshes for routes that should get handled by client-side JS routing, and return an index page so the main JS bundle gets loaded and the router can kick in. If a request matches the criteria, it rewrites the request to be /index.html instead. Unfortunately, my request met that criteria, even though it shouldn't be rewritten.

I was already overriding the Create-React-App config, so I had to do further overrides to tell the middleware to not rewrite those URLs and handle the proxying correctly.

Lessons learned:

  • Hand-editing files in node_modules to add logging statements is a completely valid debugging tactic
  • Isolating where the problem is likely occurring helps you focus attention on knowing where to log things

Final Thoughts 🔗︎

Debugging is a combination of gathering information, problem solving, and knowing how to use the tools you have available. Articles and courses can teach you a lot of things, but debugging is something that I feel really requires practice and experience.

So, the next time you see your application spewing errors into the logs, or something just "doesn't work", DON'T PANIC! You can figure this out :)


This is a post in the Blogged Answers series. Other posts in this series: