Smiling Bald Guy

Solving the world's problems using code (that honestly could use a few more unit tests).

Archive for February 2012

Instrumentation, your Application, and You, Part 1: Tracing

leave a comment »

I wish I could find a way to make instrumentation sexy; every time I try to explain to someone about why they should invest in instrumentation, I see them get the “deer in the headlights” look and try to find a way to get away from this crazy bald man. It’s just so BORING, so much WORK … and so invaluable.

Let’s talk about debugging for a moment. What are you going to do when your code leaves your development machine and goes out into the cruel real world full of users that want to hurt it? Mark my words: eventually, your code WILL fail. Software systems are rapidly approaching the complexity of biological systems, and the number of failure scenarios you didn’t consider vastly outnumber the ones that you did when you wrote your code, and if your product becomes as popular as you’re hoping it will, it’ll eventually wind up in some dark alley with some mean user. The question then becomes how gracefully your code will fail, and how much work you’ll have to do in order to fix it if it fails doing something important.

My philosophy is that attaching a debugger is well and good, but there are only a few scenarios where that’s going to be possible or acceptable. If you’re lucky enough to develop in-house Web or LOB applications, you have a lot of leeway – they can afford to have you (or one of their technical resources) go to a machine and attach WinDBG or CDB. However, commercial Enterprise applications, Internet-facing sites and client-side code is held to a much higher bar. With a live Internet-facing site, you can’t just hit a breakpoint and dig around as it brings the app to a screeching halt. Even using ADPlus to script a breakpoint that snaps a minidump and resumes is no good as it still blocks the app for an indeterminate (although admittedly pretty short) amount of time. With an Enterprise or client-side application, you generally cannot ask a customer to install a debugger and get that minidump for you.

So, if you can’t attach a debugger, how can you get the information you need to diagnose a bug in the field? Well, there are three primary ways and one interesting way. You can use trace logging, event logging, performance counters, and finally, WMI. Today, we’ll talk about tracing.

There are plenty of times where that hard-won minidump simply wasn’t enough to figure out why the application died; after all, not only is it rocket science to most developers, but all you’ll see stack traces and current values for members – not HOW you got to this point. I can’t tell you how many times I’ve looked at an exception and muttered “but that’s impossible!” A lot of the art debugging is understanding what path the app took to get to this crashing stack, and a good trace log can let your code do a lot of that detective work for you. Even minimal stack tracing (entry/exit) gives you a great deal of behavioral information, and this can be invaluable when you’re trying to figure out WHY you’re waiting on that Mutex again.

Tracing undeniably has a non-zero overhead associated with it, and this overhead takes two forms. First, there’s the behavioral/code overhead that goes with getting people in the mindset of leaving post-it notes in their code for the unfortunate soul who has to troubleshoot their misbehaving app. Second, there’s the performance overhead that goes along with adding more calls to the application. The first is manageable, and the second is fixable.

On the behavioral front, it’s very much an evangelism issue and it’s like selling insurance for your career. Once you have an app out in the field with zero instrumentation and some high-profile customer is screaming for your head because you STILL don’t know why your app is hitting this exception, you gain a deep appreciation for how much you would have loved to know what the response from that SOAP call was before you started processing it. As for the amount of work it requires from the developer, it all depends upon the class of application; there’s a dial you’ll want to turn for the amount of tracing code you write. When I’m wearing my “code monkey” hat, I want tracing at every decision block, every exception handler to dump the exception and logging the parameters to methods at “verbose” levels. When I’m wearing my “field command” hat for simple LOB applications, I want tracing before and after major events in the code like network/IO calls. But in practice, it’s going to be up to the developer to understand how their code might misbehave and to leave a trail of tracing breadcrumbs to help some poor soul noodle out why some button didn’t get enabled. I should also take this opportunity to point out that the “some poor soul” role may be you; I don’t know many developers who can recall exactly what the intent of their code was six months after they checked the code in.

Now, let’s talk about the performance overhead. Most developers don’t stop to think about the cost associated with constructing the call stack for a trace call, and when they see that their application’s runtime performance suffers because they have a bazillion trace calls that call ToString or box up a bunch of integers they’re quick to blame tracing in general, and yank it altogether (seen it happen first-hand, that). There’s no reason it has to be this expensive, though. If tracing isn’t enabled, it should just get out of the way – it’s only when we’re debugging that we want that overhead, and then we won’t care as much if we take a hit.

The .NET Framework can provide some assistance on this front. Since .NET 2.0 the tracing infrastructure has included a pair of concepts known as trace sources and source switches as a more flexible replacement for the original Trace class. With a little bit of work, you can use these to write code that only logs trace messages when you’re asking for them, and only on the sections of code you’re interested in. Rather than re-hash the fine work of others, I’ll include some links here:

How to: Use TraceSource and Filters with Trace Listeners: http://msdn.microsoft.com/en-us/library/ms228993.aspx

Unhandled Exceptions and Tracing in the .NET Framework 2.0: http://msdn.microsoft.com/en-us/magazine/cc163767.aspx

It’s important to note that, in order to minimize the impact of your tracing with TraceSource, you’ll still need to include some code of your own. For example, if you call ToString() on an exception when you log it, or you dump out the values of all of the class members in the log, you’re taking on a lot of overhead to generate a message that never gets delivered. In order to avoid this, make sure you wrap your trace calls with a conditional check against the source switch’s ShouldTrace method:

if (mySource.Switch.ShouldTrace(TraceEventType.Information))
{
mySource.TraceEvent(TraceEventType.Information, "Some expensive string");
}

In the world of unmanaged C++ code, this is made easier using a tool named TraceWPP. TraceWPP is a C preprocessor that wraps trace calls with conditional checks on trace level and component flags. It’s an unfortunate reality that all of the wonderful work the Windows Driver guys did on TraceWPP was lost in the managed world; Microsoft does not provide a tool for preprocessing managed code and wrapping trace calls with runtime-evaluated conditionals to shortcut the construction of the call stack if tracing is not enabled.
So, if we can’t use TraceWPP, what can we do, then? Well, one could certainly hand-generate those conditional blocks to detect whether tracing is enabled for a particular trace level, but those are quite annoying, and it’s easy to engage in copy-paste behavior that accidentally mismatches a check on a source or switch with the intended source/switch. However, I just happen to have a way to solve this problem with tools rather than labor. I have released a development tool named NTrace; this installs a preprocessor for your C# code that automagically builds these conditional blocks and, even better, writes your traces to Windows Event Tracing for Windows infrastructure. I won’t go into the details of NTrace in this post, but if you see the CodePlex site for NTrace here, we explain the motivation for using ETW instead of the default trace listener infrastructure.

Regardless of whether you use NTrace or the trace classes delivered by the .NET Framework, I hope that this post has given you more insight into why you should include tracing in your code and how you can improve the return on your investment by only logging when you ask for it. If you adhere to these guidelines, you should become the envy of your peers as you effortlessly diagnose crashes (and hopefully discover that it’s not your bug).

Written by Andy Hopper

February 6, 2012 at 9:43 am

Posted in .NET, Debugging