January 08, 2005

Automatically detect all memory leaks at compile time?

Yesterday at work I spent a couple of hours tracking down a troublesome memory leak. Memory leaks are my third least favorite kind of bug (race conditions and dangling pointers are worse). They are also annoying because it is next to impossible to write unit tests that verify memory does not leak. If I refactor some code, how am I to know if I have introduced a memory leak? This leads me to my question...

Is it possible to automatically detect all possible memory leaks at compile time? In general, and certainly for C-style programs, I think the answer is no. But if a program is structured in a certain way, the answer is definitely yes. This post was originally titled, "Using reference counting safely", and that is mostly what it is about.

But first, I think it is important to point out that this technique is, in a sense, inferior to garbage collection. Garbage collection makes memory leaks impossible (by memory leak I mean unreachable allocation). If memory leaks are impossible then there is no need to attempt to detect them! Of course, there are lots of trade-offs here. For example, garbage collection might be infeasible for performance reasons.

Anyway, I am going to assume that you all understand reference counting, its Achilles Heal (cyclic data structures) and how weak references can solve the problem (by breaking cycles). In summary, reference counting can be used safely if the data structure is acyclic or if cycles are broken using weak references (of one form or another).

My idea is to have the compiler or a compile-time tool analyze the program's data structures and determine whether there is any possibility of a reference counting memory manager leaking. It would do this by looking for any cycles that are not broken by weak references in the program's reference graph. If the language supports reflection, this is quite straightforward. It is simply a case of constructing a reference graph, where the nodes are classes and the arcs are potential references between objects of those classes.

The algorithm would be something like this:

for each class in the program

{
add a node for that class
}

for each class A in the program
{
add a directed arc from A to each of its base classes

for each non-static non-weak uninherited reference field in A
{
for each class B that inherits from the referenced class
(including the referenced class itself)
{
add a directed arc from A to B
}
}
}
I think that's right but I have to admit I have had to repost 3 times with bug fixes! Then it is a matter of searching for cycles in the generated graph. If there are no cycles then it is guaranteed that a reference counting memory manager will not leak.

I think this is an ideal replacement for garbage collection for my C# to C++ translator. Reference counting is more feasible on a console platform than garbage collection. It burns relatively little CPU time, accesses memory coherently and is naturally incremental.

Many existing C# programs will work unchanged, specifically those that do not use cyclic data structures. If a C# program does use cyclic data structures, I won't need to wait until it crashes to find it has a memory leak. The compiler will not even allow me to build the program and it will tell me which classes and fields are involved in the cycles I need to fix.

Conversely, the changes I need to make to C# programs will not prevent them from working correctly when run using a garbage collector, for example if they are also run on a .NET virtual machine.

Most importantly, I will have a 100% safe fully automatic memory manager, which will absolve me of the need to think about how to test for memory leaks and how to manually free memory.

Update:

I've been thinking about this some more. What the compile-time analysis does is ensure there is no way a program can represent a cyclic data structure (unless broken with weak references). Just because a program could represent such a structure does not mean it will actually create one. Unfortunately, because the
halting problem is undecidable, it is not in general possible for a static analysis to determine what a program will actually do.

My concern is that the static analysis might be too constraining in some cases. My intuition says that in most cases it will be okay. What the constraints basically mean is a program's data must have a tree structure (actually a DAG). When it doesn't, use weak references.

But there might be a few cases where the static analysis is over-constraining, although I can't think of any examples. Just in case, I thought of an alternative way of making a reference counting manager safer, without any compile-time analysis.

One of my original problems was that I could not think of a way of testing whether my code had memory leaks using unit tests. I realize now that if I use a reference counting memory manager, it is easy to determine whether a memory leak has occurred. Just run a garbage collector! It would be possible to garbage collect after every unit test, but depending on the complexity and volume of unit tests, this might slow things down quite a lot. Maybe I could garbage collect once after running all the unit tests. If there is a memory leak then I can run them again, this time with a garbage collect after every test. Or maybe I can use some kind of binary search to narrow in on the unit test with the leak.

I'm on the fence as to which technique is better. The static analysis means I just don't need to think about memory leaks but I might be frustrated by the constraints. The unit tests combined with garbage collection approach means I don't have any additional constraints but I am able to write code that leaks memory. But at least I have an automated way of finding leaking code.


Comments:
I was always intrigued the idea that a tool could prove or disprove correct memory handling on all execution paths. When limiting yourself to certain object allocation schemes with C++, this could be achievable. Still I haven't seen any RAII (Resource Acquisition Is Initialization) conformance checker tools.
 
I worked on a project that did static analysis of C++ programs a while back. It extracted metadata and exposed it through a reflection API. One problem is that C++ is notoriously difficult to parse. But that is not the biggest issue, IMO.

The problem is that C++'s semantics are loose compared to other OO language. If you want, you can use C++ like an untyped assembly language. An important part of the meaning of the program is lost in the C++ source code and is thus not available to a static analysis tool.

To use static analysis of C++ effectively, I think you need the analysis tool to ensure that C++ is used in a constrained way. This is what I had to do for my C++ reflection project. The funny thing is though, the constraints that I ended up applying made C++ look like Java or C# with an ugly syntax! So why not just use Java or C#?
 
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?