Does Not Compute

  By Shamus   Apr 18, 2007   31 comments

A bit of code-blogging.

I spent some of yesterday afternoon debugging a very mundane bit of code. Nothing special. A bunch of memory space was getting trashed, as poorly managed memory is wont to do. I picked through the code and found some appaling bugs. (Which I had written.) The full code was a couple of pages long, but the thing boiled down to this:

void DoSomething (void *thing_in)
{

    void       *thing1;
    void       *thing2;

    thing1 = malloc (sizeof (*thing_in));
    //Do some stuff with thing_in and thing1.
    thing2 = malloc (thing1, some_new_size);
    free (thing1);
    //Do some other, unrelated stuff. Then...
    memcpy (thing2, thing1, sizeof (*thing1));
    //now that i just got done copying a free'd block of memory to thing2,
    //I'll do some more stuff with thing2.
    free (thing1);
    free (thing2);

}

Now, the actual code had all sorts of crazy stuff going on between these steps that obfuscated the mistake, but at the heart of it this is what I was doing: I was free()ing blocks of memory more than once, and copying stuff from recently unallocated blocks of memory. Now, this in and of itself is not very remarkable. If you lose your head you can easily make a blunder like this. The thing is, it should result in a crash. What I’m doing above is programatic sepuku.

But in this case the program did not crash. At least, not every time and certainly not right away. Not only did it not instantly die when it executed this code, but it more or less did what it was supposed to, or failed in innocuous ways that weren’t readily apparent. It called this function several dozen times a day, and yet the program ran on like this for a span of nine months without anyone noticing something was seriously amiss.

There was another program that would restart this one in the event that it crashed. I don’t know how often that happened. I’m afraid to look. I’m terrified that I’ll discover my program ran like this for all that time, like a man going on and living his life without noticing that a couple of his major internal organs have been removed. This is not supposed to be possible.

The whole thing gave me the willies, really.

LATER: Edited the example to better show what the problem was.


201131 comments. Hurry up and add yours before it becomes passé.


  1. karrde says:

    Weird….

    That would give me the creeps, if I discovered it in my own code.

  2. Skjalm says:

    It might have (sort of) worked because the original memory allocated to thing1 was never claimed by other parts of your program?

    Or maybe you just got lucky in which case congratulations are in order :-)

  3. mockware says:

    I had an interesting situation a long time ago where I made some changes to some code. The code base was ~100 Mb of text with 100s of source files. Anyway I included another header file and made a simple change but when the program ran, it crashed in another module. I would debug compile that module and the app would crash in a different module. After debug compiling a couple more, I had to resort to just looking at the assembly code where it was crashing. The assembly where it was crashing was a jumbled mess but luckily the debugger would also show an ASCII window of the section in memory and the ASCII was a normal sentence. Somehow I was writing data into the code segment of the program. It turns out some “genious” in the company had redefined malloc in the header file I included to return an indirect pointer and I was using the returned value as a real address.

  4. Pixy Misa says:

    It’s you! IT’S YOU!!!!

    There’s exactly this bug somewhere in one of the libraries I’m using in Minx, and it’s making the application drop dead every few days. I have three instances running behind a load-balancing proxy, and they all auto-restart, so this doesn’t cause much inconvenience, but it’s proving a bugger to track down.

  5. Dave says:

    And this is why I’m glad I’m a web guy, and so can write code in languages where memory management is the compiler’s job…

  6. Randolpho says:

    Ug… C? I thought I had left that ugly mess behind me.

    Your bug was realloc, but more importantly, I think your biggest mistake in general was deciding to use realloc in the first place. Better programmers than you and I have shoot off their big toe *many* many times with realloc. Most people justify the use of realloc as “saving” memory, being “efficient”. “Bah”, says I.

    realloc has one place, and one place only: dynamically resizing arrays a la vector (C++/STL) or ArrayList (Java/C#/.NET). When you use it, the first thing you do is check to see if the returned pointer is the same as the original pointer. If not, copy.

    Ok, C lesson over. Make with the DM of the Rings! :D

  7. Randolpho says:

    Hmm… on re-examination of the code, I see that your use of realloc was probably correct; you were trying to grow thing1 to be bigger, but created a separate pointer rather than re-using thing1. I blame a lack of morning caffeine and not having snickered at DM of the Rings.

    Also, ignore the last two sentences in my second-to-last paragraph. realloc copies the memory for you. Again, I blame uncaffeination.

    Wow, I’m really bitchy in the morning, aren’t I? :D

  8. Zack says:

    Actually the was a professor at MIT who was working on a project like this. He basically was making hardware (disks, memory) and libraries (sqrt, round) that could return garbage data. Usually they would work but sometimes they would just return a random value and the purpose of the study was to see how much garbage could be returned before crashes were expected.

    Much to his dismay the programs and hardware was able to handle an appalling ratio of garbage while just merrily chugging along. I will try to ask his student later if he can point me at any links or numbers. It was really funny results though.

  9. Shamus says:

    The realloc wasn’t as explicit as you see there. I was actually calling a function which I thought was just making a copy, but instead it was making a copy AND free()ing the original. (An effective realloc, really.) This explains both blunders (copying from free memory and freeing already free memory) but I still can’t fathom how the program got away with this repeatedly without blowing up.

  10. Zack says:

    The garbage data thing at MIT I mentioned before. I think it was a set of database drivers that would return garbage results instead of actually reading the disk. It was a project that was dealing with error recovery in distributed computing and generating large amounts of garbage data was one way to do that. But interestingly enough the application ran happily despite all the garbage inputs. The particular application was just fine spewing random results. They were expecting something a bit more catastrophic, but it ran fine for days.

  11. Yahzi says:

    9 months? You wuss… I think my record is ten years. A bug that should have resulted in a crash going unnoticed for TEN YEARS.

    Sheesh, kids these days…

    :D

  12. Yahzi says:

    “I think your biggest mistake in general was deciding to use realloc in the first place.”

    I’ve been coding for 20 years. I’ve written self-modifying assembly code, recursive algorithms, interpreters, machine control, FFTs, and user interfaces… but I’ve never found a use for “realloc.”

    Remember, kids, just because you can, doesn’t mean you should.

    :D :D :D

  13. Tim Keating says:

    It’s not all that surprising, really. Most memory managers don’t immediately trash a block when you free it — the pointer to that block just gets removed from the heap’s hash table. So unless you did some other allocations & writes, in a tight space like that the contents of your memory are likely to remain the same . . . which is, of course, what makes these kinds of bugs so insidious.

  14. Shinjin says:

    Been awhile since I’ve been this close to C, but if memory serves, realloc() (which it sounds like you were using insider your free/copy function) will only allocate a new block if there is not room at the existing location. So if NEW_SIZE happened to be smaller, thing2 == thing1. If NEW_SIZE happened to be bigger, thing2 is not necessarily going to be different from thing1.

    Also, your memcpy call as written above will only copy 4 bytes (sizeof a pointer). But that’s no biggie, since if realloc() did not have to allocate a new chunk of memory, thing2 already points to the same location as thing1 and the memcpy() is essentially doing nothing. (This could just have been an artifact of you compressing the code down to a readable example, but since it’s in your example I figured I’d comment.)

    As to freeing the memory twice, I’ve seen C compilers sometimes allow this without any ill effects. In our in-house code we got in the habit of using a custom ‘safe free’ function that would set the pointer to null after freeing it. That way if you tried to free() it again, it *would* crash.

    Bottom line is that it looks like the code likely only failed critically when the call to realloc() truly had to grab a new chunk and only 4 bytes was copied into the new location.

  15. bruce says:

    I have a secret suspicion that know one really knows how computers work, they just do (or don’t). It’s like magic. Sure you know the words of a few spells that can make things happen and some people know lots of complicated spells that can make lots of things happen, but no-one really understands why or how it all works. They might even understand some of the rules and logic behind it, but no-one knows exactly what force makes it happen.

    I mean why can a computer do the same thing every day and then suddenly not do it? Lack of magical energy, that’s why. Maybe somewhere, someone just stopped believing in computers…

  16. Kris says:

    Brilliant. I love it when computers do exactly what you tell them, and nothing of what you expect.

  17. wumpus says:

    Howdy,

    void *thing1;
    void *thing2;

    thing1 = malloc (sizeof (thing_in)); //Do some stuff with thing1.
    thing2 = realloc (thing1, NEW_SIZE);

    // thing2 == thing1, in all probability, and the contents of memory are almost certainly untouched
    // if NEW_SIZE

  18. Cadamar says:

    Managed code all the way, baby.
    C# forever!

    Self-modifying assemply code?!?! Good lord!
    That’s hard core, Yahzi.

  19. wumpus says:

    Odd. Apparently some of my formatting caused my post to be stomped upon. Something about using less than or greater than signs?

    Alex

  20. Steve says:

    He he. Back before gravity had peeled away from the weak nuclear force I worked in a place that had a program (assembler code) which only went wrong on alternate tuesdays if the operator wore tweed trousers but not if the other one had a t-shirt in any primary colour on. It didn’t so much error as produce mad results (sometimes).

    The programatical equivalent of the noise in your car that won’t happen if a mechanic is within earshot.

    Needless to say it took years to find the error, which turned out to be a “load and jump” that was loading a return address pointing to the DATA bank rather than the instruction bank. Since the data item pointed to was often a legally attainable target for a jump instruction the program just did the occasional mad thing most of the time. After some years the damn thing finally tried to jump outside its own boundaries, dumped at the error point and we caught it.

    Stupid computers.

    Steve.

  21. Shamus says:

    I really shouldn’t have used the loaded word “realloc” in my example code. It was allocating thing2 and copying the contents over. (Which I expected.) But then it unexpectedly (to me) free’d thing1. So, thing2 was always != thing1. Other variables would come and go during this process, so it is likely that thing1 space was getting trashed, at least a bit.

  22. Telas says:

    Bruce: “I have a secret suspicion that know one really knows how computers work, they just do (or don’t). It’s like magic.”

    And I thought I was the only heretic!

    (And I fix the damned things for a living.)

    Telas

  23. vince says:

    Finding bugs like this is what valgrind was invented for, though I guess that doesn’t help if you aren’t writing the code on Linux.

  24. Dave says:

    It kept running because the Cat in the Hat told thing one and thing two to jump back into his hat.

  25. David says:

    Most likely the reason youre program didnt always fail is that memory protection is at the block level, not byte. If that memory you freed is in the same block as memory your program still owns, the processor isn’t going to throw any fits when you try and access it. Of course, there is no guarantee that the contents are still valid, but since its all fairly close to each other (at least as shown) the chances that anything has tried to re-use that particular memory are pretty low.

    Lots of the debuggers and memory utilities out there put explicit marker bytes before/after their allocated regions just to help you find stuff like this, because otherwise its entirely possible for it to “just work”, at least most of the time, as you’ve experienced.

    And don’t ask about how I know all this, the experience is too painful to relate…

  26. Dirk says:

    You realise sizeof(void*) returns 4 (32-bit systems) or 8 (64-bit systems), and so on? You can’t sizeof a pointer and expect to get the size it was malloc’d with, and there’s also no standard way to retrieve it with library calls either. You have to keep that size around somehow. Make sure you’re keeping track of whether it refers to the number of bytes or number of items.

    Anyway, write in Python or Ruby. For the very little code which actually needs to be fast (and tends to be simple too), you can write THAT in C (or C++, or D, or whatever) and call it as an extension module or with ctypes or Ruby/DL. I’ve started doing things like this, and it even beats e.g. C# hands down because high level dynamic languages are a lot more expressive. And you get the major bonus of a completely open platform available on many operating systems, unlike C# where you’re totally stuck in Microsoft land (don’t anybody mention Mono until you actually try it – it’s far behind and quite incomplete).

    I write about a lot of stuff like that on my own new (and almost completely reader-less) blog. Available at http://spetskod.blogspot.com.

  27. Dirk says:

    Ah, thankyou, period of doom. Actually clickable URL is http://spetskod.blogspot.com

  28. Shamus says:

    dirk: Yeah – I just slapped that example code to show what I was doing. Should have had the * in there.

  29. Dirk says:

    That doesn’t help. Now it reduces to sizeof(void) which is undefined and shouldn’t even compile[1]. I repeat that there is no standard library way to do it, let alone a keyword for it. You have to carry around the size if you intend to use it again, or use a (non-standard) libc which lets you retrieve that information from hidden internal allocation table.

    [1]: GCC with the right pedantry flags will tell you the following:

    warning: invalid application of ‘sizeof’ to a void type

    It will evaluate to 1, which is not what you want.

    Really, write in Python. My C extensions never need to call malloc at all, I just let ctypes do it for me and pass the pointers to the C code. Free automatic memory management with the combined power of C and Python.

  30. Anonymous Coward says:

    No, you can’t say “it should result in a crash”. This code invokes what the C standards call “undefined behaviour” and exactly that it is: undefined. You can’t predict what it will do. It can crash but doesn’t have to, depending on your system’s memory management…

  31. Sam says:

    Hatred of debugging eventually drove me to Haskell. Not only does automatic memory management prevent the kind of bug described above, but the type-checker catches most of the bugs that used to trip me up all the time in Python. The pure-functional style (as weird as it can be) is another great way to avoid bugs.

    The only bugs I’ve managed to get past the compiler are semantic bugs (by which I mean the error can be described in terms of the problem domain). Even there, the pure-functional style is often a big help—I can take malfunctioning expressions apart until I’ve isolated the error.

Leave a Reply

Comments are moderated and may not be posted immediately. Required fields are marked *

*
*

Thanks for joining the discussion. Be nice, don't post angry, and enjoy yourself. This is supposed to be fun.

You can enclose spoilers in <strike> tags like so:
<strike>Darth Vader is Luke's father!</strike>

You can make things italics like this:
Can you imagine having Darth Vader as your <i>father</i>?

You can make things bold like this:
I'm <b>very</b> glad Darth Vader isn't my father.

You can make links like this:
I'm reading about <a href="http://en.wikipedia.org/wiki/Darth_Vader">Darth Vader</a> on Wikipedia!