How to Debug in Production Securely and Efficiently
Over the past decade, software complexity has evolved to a remarkable degree. Today we have distributed systems in the cloud; they run on VMs, containers, or serverless functions. And while the aggregation of third-party libraries has made it easier and faster to build production-ready systems in a minimal amount of time, we still can’t predict all possible system states in advance.
As a result, testing and debugging in production have become increasingly popular in recent years. Since it’s impossible to know all potential system states before a service goes live, checking what states are used in production is a more practical approach. One of these techniques is debugging in production.
What Is Debugging in Production?
Debugging in production, or live debugging, is the practice of gathering debug information, like the content of variables at a specific execution point, from a production system. Debugging is generally carried out in a particular development or staging environment that’s inaccessible to end users. What makes live debugging feasible are so-called tracepoints. Tracepoints are the production equivalent to breakpoints in development. While breakpoints stop the debugging process when a specific line of code is encountered, a tracepoint will simply lead to a snapshot of the system state, and the process will keep running. You can use tracepoints after instrumenting an application with the Sidekick agent and connecting it to your code base.
This technique can accelerate the development process by allowing your team to look for bugs with real-world data, without having to set up a dedicated debug system first.
The downside is performance degradation when the debugging is running. To prevent this, you can configure conditional tracepoints that only take snapshots when a variable has a specific value.
You can set up a particular debugging variable that must be true to enable the snapshots so others using the system don’t encounter performance issues.. As a bonus, these variables don’t pollute your snapshots with irrelevant data. Alternatively, you can check for user IDs in variables before triggering the snapshot, and ask users to reproduce the problem they’ve encountered. Afterward, you can read the relevant snapshots.
Debugging in 7 Steps
Now that you’re familiar with debugging in production, we’ll show you how to do it securely and efficiently. There are seven debugging steps to anchor you throughout the process.
1. Reproduce the Bug
Before you take any steps towards fixing a bug, make sure you can reliably reproduce it—otherwise you won’t have proof that you have indeed fixed the bug (and it didn’t just randomly stop appearing). The more you can automate reproduction, the better. Complex bugs can quickly become a headache if you have to jump through multiple hoops each time to see if they’re fixed or not.
Sidekick is a useful tool for extracting the system state at a specific tracepoint from production, so you know exactly how to set each variable later.
Sidekick also prevents you from replicating the system locally. You can now debug what’s “out in the wild”: the actual system with real data. The production system with all its microservices is often too complex, so certain bugs might be impossible to reproduce on your machine anyway.
2. Understand Stack and Other Traces
Stack traces—do they go from top to bottom or the other way around? Make sure you know how to read them to learn what steps were taken before the system crashed. They provide invaluable information about the last function calls.
You should also understand Sidekick traces. While they’re generally more user-friendly than a CLI output, you should take some time to become familiar with them to make sure you know what you’re seeing.
Sidekick’s tracepoints make debugging your superpower. Tracepoints reveal not only what functions were called before the bug occurred but also the services that were part of that request. With tracepoints, you can see all the steps a request has taken through a distributed transaction.
3. Treat Your Bug to Unit Testing
Once you know how to reproduce a bug, you should consider the correct behavior—it might not be obvious. Think carefully about the result you want to achieve and write your unit tests accordingly.
This way you’re guaranteed to know when you’ve reached your goal and get notified of any future bugs.
4. Know Your Error Codes
It’s 400–499 for errors on the client’s side and 500–599 on the server’s side, right? Well, there’s a bit more to it. There can be errors on the client’s side caused by server problems, or the other way around.
Make sure you know the more profound implications of different errors. Lately, framework and programming language creators have been putting more effort into creating more descriptive error messages. However, with legacy software, you’re often stuck with just a number.
Take the time to learn more about the error codes of your development stack.
5. Use Search Engines
Google, DuckDuckGo, and Bing might be the first websites that come to mind, and rightly so: They’re all excellent resources for debugging, especially if you consult more than one search engine and get as many different results as possible.
But you should also use the integrated search engines of StackOverflow, Wikipedia, and GitHub. Open-source software in particular can make it easy to find the line of code responsible for the error code you’ve just come across while debugging.
6. Ask Coworkers
If you happen to work in a team, always ask your fellow developers to help—they might know something you don’t. At the very least, it can help to have them sit with you for a bit while you’re debugging.
Looking too long at the same code can blind you to otherwise apparent issues, which is why bringing in a fresh pair of eyes is always a good idea.
The same holds true for the snapshots from Sidekick that often contain dozens of variables. It’s always a good idea to ask for a second opinion.
7. Honor Your Success
Once you’ve finally fixed the bug, make sure to honor your work. Debugging can be stressful, especially if your bug hunt goes on for several days, so when it’s done, treat yourself. Take a break; go for a nice walk. Most importantly, talk about your experience: Write a blog post or just have a chat with a colleague.
There is no end to bugs, and it’s crucial to have the ability to process and learn from past successes and failures.
With Sidekick open-source, debugging in production is now easier than ever. Sidekick saves you time since you don’t have to replicate the system to debug. Plus, working with accurate data, you don’t have to contend with incomplete local replications.
Tracepoints let you extract the system state without blocking the production system, while logpoints help you dynamically collect new logs, define conditions, change log levels in real time, and decide where you send your logs on the go. Get started with Sidekick open-source to keep the bugs at bay.