Coding a Parser

By Shamus Posted Wednesday Dec 26, 2012

I spent Christmas day coding. That was fun. As part of my efforts to move to Linux, I decided to port some of my code. One of the first things I’ll want in the world of Linux is the ability to read .ini files.

I really like .ini files. You can put any program settings in them in any order. You can edit them with a text editor. You can read and write to them from within your program. This is much better than (say) storing all your settings in binary files. Some people are moving to XML these days, but XML files are massive overkill for a job like this, and end up being incredibly verbose and annoying for humans to read. For context, here is the .ini file for Project Frontier:

[Settings]
Treepos=3611.557861 2473.824219 8.986277
TreeSeed=654
 
[Animations]
Idle=idle
Running=run
Sprinting=run
Falling=fall
Jumping=fall
Swimming=fall
Floating=idle
Flying=idle
 
[Avatar]
CameraDistance=11.00
Angle=76.000000 0.000000 73.199890
Position=7806.417969 4053.380615 -0.217506
Flying=0
MouseSensitivity=1.00
InvertY=1
 
[Shaders]
ShaderNormal=standard.cg
ShaderTrees=trees.cg

The drawback of .ini files is that they’re basically a Windows thing. If you’re writing code targeted at windows, then you can change one of the above settings like so:

//Specify section, then entry, then the new value, then the file being changed.
//I have no idea why the inputs are in this order. Wouldn't it make more sense
//to list the FILE first, so they're in ascending levels of specificity?
//But whatever....
WritePrivateProfileString ("Shaders", "ShaderNormal", "cellshade.cg", "frontier.ini");

Perhaps in your code you don’t put paragraph-sized editorials in the comments? I dunno. That’s your business.

The problem is that WritePrivateProfileString () is not available on other platforms. If you want to use ini files elsewhere, then you need to write your own version of these. This means writing a text parser. Text parsers can sometimes be kind of fiddly.

The big problem is that C++ is not the best language for juggling text. In fact, it might be one of the worst. Yes, you can use std::string. That’s not so bad. But sooner or later you’ll want to pass around char* strings. (If this never happens you you, then you’re probably a student or working in a really cutting-edge environment where you never have to interface with code that’s more than a few years old. Most of us, sooner or later, need to use a char*.) When that happens you’ll need to make a char*, allocate some memory, copy the std::string, and then forget to free () the memory later because screw this language, man. And even when you’re free to use std::string, you still wind up with situations where you can’t do simple things like this:

void HeaderName (const char* section_name)
{
  string   section_header;
 
  //This is not allowed:
  section_header = "[" + section_name + "]" + EOL;
 
  //Presumably we would do something more here. Maybe return a value or something?
  //Don't judge. YOU try writing plausible example code!
}

You either need to clutter up the code with a bunch of casting, or you have to break the operation up onto multiple lines. That’s fine. You’ll still get the job done and it’ll still work, but it can be done cleaner and faster in other languages, and with less memory-management pitfalls.

But the real horror show begins when you have to maintain a parser written in vanilla C. No std::string to help you. No new and delete to make allocating memory easier and less dangerous. When you build a parser in C, you are cutting down a tree with a utility knife.

In my Activeworlds days, I had to maintain such a parser. It had been written in 1994 or so using nothing but base C. Also, the material being parsed was particularly troublesome. It was a scripting language that…

…was often written by end users. The parser needed to be VERY forgiving of errors or it would drive people crazy and be too large of a challenge for the average person to learn. If possible, one bad command shouldn’t prevent subsequent commands from executing.
…was used in bulk. The users went around the world tagging objects with scripts. For example, the picture frame on the wall might change images when clicked, or bumping into an object would teleport the user to a new place. Scenes were frequently made of thousands of objects, and all of this was taking place in the late 90’s, before the days of ubiquitous graphics acceleration. Every CPU cycle cut directly into the framerate, so the parser needed to be as fast as possible.
…was designed to be as terse as possible. All of this data was flowing down the user’s 28.8k modem connection. Users, being users, will naturally expand to use ALL available space. Without limits end users will never stop packing in data. If you let them write a megabyte-long script, they will. And then they will copy & paste that script onto every object in the vicinity. So each object was limited to 256 bytes. With limits like this in place, every single byte matters. Certain fields need to be optional. There need to be abbreviations.

For example, one thing I added was the rotate command.

rotate [x] y [z] [sync OR nosync] [time=time] [loop OR noloop] [reset OR noreset] [wait=wait] [name=name] [smooth]

You could use this to make an object spin on all axis, like so:

rotate 10 20 30.

If I remember correctly, those numbers were expressed in RPM. That command would spin the object 10 times a minute on the X axis, 20 times a minute on the Y, and 30 times a minute on the Z. However, the vast majority of cases where the command is used are just to make an object spin on the Y axis. (Like a merry-go-around.) So instead of wasting FOUR WHOLE BYTES specifying zeroes for X and Z, you’re optionally allowed to specify only Y.

After the numbers(s), you have a list of directives that may or may not be used. Using the rotate command, you could make a door that swung open an closed. You could put hands an an analog clock that would always show the correct time. You could make a blade that swings back and forth, Skyrim dungeon-trap style. You could make a spinning helicopter blade.

That’s a lot of power in a very small command, with the downside being that it’s a pain in the ass to parse.

Parsers usually work by taking a block of text and breaking it up by whitespace. (Whitespace is any non-printing character. Spaces, tabs, line feed.) Extra whitespace is ignored, so rotate 5 is identical to rotate 5. This is how pages are parsed by your web browser. It’s how my C++ compiler reads my code. It’s how ini files, css files, and MS-DOS batch files are read.

Parsing code remains the only place I have ever seen the forbidden goto used in production code. In many parsing situations, once you’re past a word you can discard it, but in this parser there were situations where you needed to save values for later. So the parser would allocate a whole bunch of stuff, saving values while reading a command that could end in error at any time. I no longer have access to the codebase, and I’m pretty sure the whole thing was re-written at some point in the last few years, but back in the day I remember seeing something like:

void parse_things ()
{
  char *thing1, *thing2, *thing3;
 
  thing1 = thing2 = thing3 = NULL;
  //allocate thing 1
  thing1 = get_next_string ();//makes a copy
  if (whole bunch of complex tests prove that thing1 makes no sense)
    goto done;
  thing2 = get_next_string ();//makes a copy
  if (more tests to see if thing1 and thing 2 make no sense together)
    goto done;
  thing3 = get_next_string ();//makes a copy
  if (more tests to see if all three things fail to add up to a proper command)
    goto done;
  //Yay! The user managed to enter something coherent!
  do_thing (thing1, thing2, thing3);
  done:
  if (thing1)
    free (thing1);
  if (thing2)
    free (thing2);
  if (thing3)
    free (thing3);
}

Keep in mind that those if () statements are underselling the complexity at work here. For example, if thing2 ends in .jpg or .png then it is a texture name for sure. But if not, then it still MIGHT be a texture name, pending the contents of the other things. But those other tests shouldn’t be performed unless thing2 doesn’t have an extension. Later, if thing2 doesn’t have an extension and we figure out it must be a texture name, then we append .jpg. And so on. You get the idea. We’re talking about complex branching logic, done in stages, each of which allocates memory that needs to be released before we move on.

Now, the C/C++ orthodoxy is that goto is forbidden. Never shall ye use it, lest ye be subject to ridicule and possibly stoning. That’s mostly true, and even if you do happen to encounter a situation where goto looks like a good solution, most programmers will avoid using it because of social pressure. Using goto in production code is the equivalent of an electrician turning screws with a butterknife. It might get the job done, but it looks unprofessional.

But I could never see any good way to avoid the use of goto in the parse_thing () code. Sure, you COULD get rid of it, but just about anything else would require more redundant lines of code, much deeper nesting, and more convoluted logic. Despite what they teach you in school, readability trumps orthodoxy. Most likely the presence of goto here signals that the back end of the parser itself (where it pulls text in) is perhaps built… oddly. I won’t diagnose it further to avoid getting into nitty-gritty details and publicly critiquing code written by other people while I was still making tacos for a living. But the point is, this goto was probably a symptom, not a problem.

Hang on. What were we talking about again?

Oh! ini files, right. Totally forgot. So yesterday I wrote some code to parse Windows-style ini files. I usually hate parser work. It’s very fussy and has lots of little pitfalls and hassles and headaches. What if this whitespace is part of the data? Is this file using Windows or Linux style line breaks? What if the user has odd spaces where they shouldn’t, like inside the [Section] header? What if some of the data has markup in it, like so:

[User]
LoginName=[email protected]
CharacterName=[[[masta killa187]]]
InvertMouse=0
RememberPassword=0

If done wrong, then Bob would break the settings file when he names his character “[[[masta killa187]]]”. He arguably deserves it, so maybe you can pass this off as a feature?

Parsers always seem simple at first, but even something as rudimentary as an ini file can have a lot of possible routes for chaos once you allow for the fact that they must contain the most dangerous form of information in computer science: User entered data.

Like I said, I usually hate writing parsers. They’re boring drudge work. But for whatever reason I was in the mood for that kind of work yesterday. So… that’s what I did. Seemed to turn out okay.

Did I really just spend 2,000 words rambling on through digressions instead of getting to the point, which wasn’t interesting anyway? I might have. Anyway. How was your Christmas?

Shamus Young is a programmer, an author, and nearly a composer. He works on this site full time. If you'd like to support him, you can do so via Patreon or PayPal.

 Ok CancelPrevious Post

Next PostCoding Style Part 1 

From The Archives:

112 comments

112 thoughts on “Coding a Parser”

Varil says:

Wednesday Dec 26, 2012 at 8:25 am

“So the parser would allocate a whole bunch of stuff, saving values while reading a command that could end in error at any time. I no longer have access to the codebase, and I'm pretty sure the whole thing was re-written at some point in the last few tears, but back in the day I remember seeing something like:”

Probably a typo, but it made me chuckle given the tone of the article.

Reply
Bodyless says:

Wednesday Dec 26, 2012 at 8:33 am

Shamous,

Ever thought about moving on to C#?

Reply
1. MadTinkerer says:
  
  Wednesday Dec 26, 2012 at 12:36 pm
  
  C#. On Linux. Even I know that’s just wrong.
  
  Reply
  1. jameswilddev says:
    
    Wednesday Dec 26, 2012 at 12:42 pm
    
    There’s been a C#/.NET implementation for Linux for years now:
    http://www.mono-project.com/Main_Page
    
    It even has a decent IDE: (though it’s VERY buggy)
    http://monodevelop.com/
    
    Reply
  2. Scerro says:
    
    Thursday Dec 27, 2012 at 1:56 am
    
    C# is really nice in MSVS2010.
    
    Well, as long as you know the Ctrl-Space shortcut).
    
    Also, I’m a new coder, so it’s not like I have deeply ingrained habits in C or C++ yet.
    
    Reply
2. Volfram says:
  
  Wednesday Dec 26, 2012 at 7:14 pm
  
  You realize at this point everyone’s been pressuring him to move to C# long enough he’s probably never going to use it, and the more you try to pitch it at him, the more he’s going to avoid it?
  
  It’s part of the reason I’ve avoided C#. That, and all the good features seem to already be available in D.
  
  Reply
  1. SKD says:
    
    Wednesday Dec 26, 2012 at 10:37 pm
    
    No,no,no. You are supposed to constantly ridicule C# or otherwise make it seem unattractive. That or tell him that it is impossible to create his pet projects using C#. Then he’ll begin using it to prove you wrong and satisfy his contradictory nature like a good little programmer.
    
    Reply
    1. Lovecrafter says:
      
      Thursday Dec 27, 2012 at 2:03 am
      
      So it’s like Josh with the Incinerator?
      
      Reply
Geoff 'Shivoa' Birch says:

Wednesday Dec 26, 2012 at 8:38 am

I would say arbitrary labels and short jumps are discouraged in code rather than the other flow control tools but are far from verboten in C. A switch/case is just a block of scope which is all about jumping to labels and many a sane C coder has used labels for sensible releasing of resources in code that can fail (similar to your example but with three labels and no if null checks as the different jump points define which pointers need to be freed). And once the compiler gets hold of your work then it all boils down to program flow by labels so the restrictions on where to use them in your own code is obviously only to avoid people blowing their leg off (and even then we provide bigger guns like longjumps for those interested in potentially removing limbs).

That ini file format seems to not require a full grammared parser but rather just a simple “[*]n” “*=*n” (eager consumption of = on second pattern) while iterating over each line of the file. (maybe also with “//*n” to allow simple comment lines) and so something a RegEx lib could provide. When it comes time to consume a complex language then I highly recommend using someone else’s hand-written parser rather than building your own (manually can only be reasonably done for very simple stuff and via a tool like Yacc or ANTLR isn’t ideal for anything as complex as C or beyond due to quirks and so anything generated needs manual tweaking to work in the real world).

Edit: it is probably worth pointing out that the above label jumping comments are written about C. C++ has RAII and using smart pointers guarantee your objects clean up after themselves when falling out of scope.

Reply
1. Nyctef says:
  
  Wednesday Dec 26, 2012 at 8:48 am
  
  Actually, ANTLR has a rather nice (and dangerous) feature where it lets you write arbitrary code into the parser spec and change the AST it generates. This pretty much lets you get around the sort of quirks you see in more complicated languages since you can override the theoretical limitations of the parser (like doing extra lookahead in certain cases).
  
  Writing parsers manually is just fun, though ;)
  
  Reply
  1. Geoff 'Shivoa' Birch says:
    
    Wednesday Dec 26, 2012 at 9:12 am
    
    Sorry, yes, that was my point and I was unclear in my language. Whether it be post-generation edits or pre-, in the grammar decorated with manual code, the fully generated parser is likely to not be as simple as making a grammar and any of the major players (from Clang to GCC to small stuff built on Lex-Yacc like PyCParser) who get to C and above complexity are doing manual work to get their parser working, with or without a base code that is automatically generated from a tool. The sane option is normally to find someone else who has already got an industry grade parser working and integrate it into your project. If you’re making your own language with your own rules then writing an unambiguous grammar that is easy to parse and generating the parser is a plan but then you might just use something simple that can parse as a linear character traversal.
    
    Reply
Nyctef says:

Wednesday Dec 26, 2012 at 8:38 am

Of course, this being linux, there are already many different ways of doing this, with “do it yourself” always being a sensible suggestion :P

As for fixing the parse_things function, I think the only really nice way to write the function involves having some sort of automatic memory management for thing1/2/3 so that they get freed when they go out of scope.

Reply
Ryan says:

Wednesday Dec 26, 2012 at 8:43 am

I long-ago migrated to a preponderance of try/throw/catch/finally syntax for parsing. C++ supports it, however working in vanilla C requires some serious finessing in other ways, such as the setjmp/longjmp method found here. (The minimal set of #Define commands which that site gives as an example to replicate try..catch block syntax is very nice.)

Reply
1. WJS says:
  
  Wednesday Mar 15, 2017 at 8:11 pm
  
  That’s exactly what I thought when I saw that. I mean, exceptions aren’t an option in plain C, obviously, but that structure seems like pretty much exactly what they are for.
  
  void parse_things ()
  {
      char *thing1, *thing2, *thing3;
  
      thing1 = thing2 = thing3 = NULL;
      //allocate thing 1
      try
      {
          thing1 = get_next_string ();//makes a copy
          if (whole bunch of complex tests prove that thing1 makes no sense)
              throw new NoSenseException();
          thing2 = get_next_string ();//makes a copy
          if (more tests to see if thing1 and thing 2 make no sense together)
              throw new NoSenseException();
          thing3 = get_next_string ();//makes a copy
          if (more tests to see if all three things fail to add up to a proper command)
              throw new NoSenseException();
          //Yay! The user managed to enter something coherent!
          do_thing (thing1, thing2, thing3);
      }
      catch(NoSenseException e){}
      finally
      {
          if (thing1)
              free (thing1);
          if (thing2)
              free (thing2);
          if (thing3)
              free (thing3);
      }
  }
  
  In the C-style, I would probably have named the label something like “error” or “fail”, rather than “done”, but that’s just preference.
  
  Reply
Infinitron says:

Wednesday Dec 26, 2012 at 8:44 am

Surely there are libraries available on the Internet that will do this for you.

Reply
1. Shamus says:
  
  Wednesday Dec 26, 2012 at 8:58 am
  
  This is always the problem with C stuff. There are probably many. Which ones are good? Is the interface good? Will it compile cleanly? Is it FULLY portable? (No sense in moving to something new if it’s just going to re-create the problem I’m trying to solve: That my code isn’t portable.) Is it properly documented? Does it work the way I want?
  
  In the time it would take you to search, download, integrate, compile, and review the candidates, you could have just done it yourself.
  
  Obviously there’s a threshold in there somewhere. Some stuff is large enough to justify the cost of code-shopping, but for me an ini parser seems like a good DIY job.
  
  Reply
  1. Neko says:
    
    Wednesday Dec 26, 2012 at 9:22 am
    
    And sometimes it’s just more fun to DIY ;)
    
    Reply
    1. Paul Spooner says:
      
      Wednesday Dec 26, 2012 at 12:24 pm
      
      I think the best is to DIY, and then find a really good one that someone else has written and compare. What trade offs did I make? Is my code more or less flexible or robust? Was there a clever technique that I missed?
      
      Looking for a solution before solving it is more difficult. Once I’ve gone through the exercise of making it work myself I have a much better idea of the problems that need addressing.
      
      Of course, Shamus has solved this problem himself several times already, so it sounds like the pure fun of it.
      
      Reply
  2. Steve C says:
    
    Wednesday Dec 26, 2012 at 12:26 pm
    
    Hmm maybe there is an opportunity there…
    An online service that creates and evaluates code. Kind of like sourceforge but with a focus on completed/orphaned code. It’s goal would be to answer “Is this code good for my project?” in an organized and accurate way.
    
    Reply
    1. Primogenitor says:
      
      Friday Dec 28, 2012 at 5:25 am
      
      With modern cloud / virtual machine / continuous integration stuff, a site like that could automatically test every piece of submitted code on at least windows / linux / mac, probably with several different flavors of each (XP, 7, 8, Android, etc).
      
      Reply
2. Nick says:
  
  Wednesday Dec 26, 2012 at 9:00 am
  
  There are. libini, for example…
  
  http://sourceforge.net/projects/libini/
  
  Reply
Suraj Barkale says:

Wednesday Dec 26, 2012 at 9:33 am

An application framework like QT really shines in this regard.

Reply
1. krellen says:
  
  Wednesday Dec 26, 2012 at 9:59 am
  
  Don’t get him started on QT. We’ll be here all week!
  
  Reply
  1. Muspel says:
    
    Wednesday Dec 26, 2012 at 10:22 am
    
    I’m looking forward to the day that Rutskarn catches on to this and starts trolling him about it on Spoiler Warning.
    
    Reply
krellen says:

Wednesday Dec 26, 2012 at 9:58 am

My Tuesday was great, Shamus. Thanks for asking.

I spent the day watching cartoon super heroes.

Reply
1. Shamus says:
  
  Wednesday Dec 26, 2012 at 11:09 am
  
  IT’S A TUESDAY MIRACLE!
  
  Reply
  1. tengokujin says:
    
    Wednesday Dec 26, 2012 at 8:51 pm
    
    Happy Kwanzaa (Day 1)!
    
    Post-I-forgot-to-click-the-box edit: I’m such an infrequent commenter, I keep forgetting to check the box. >.>
    
    Reply
Vipermagi says:

Wednesday Dec 26, 2012 at 10:10 am

It’s Christmas today (we get two Christmas days, ’cause we’re lazy like that), and all I’m doing is pointing out spelling errors on the Internet.
“[..]a door that swung open an closed. You could put hands an an analog clock[..]”

Reply
Kian says:

Wednesday Dec 26, 2012 at 10:24 am

I had to comment on something. About the char * when working with strings, calling std::string::data() will return a char * to the string’s memory, and calling std::string::c_string() (or something like that) will return a c-style string of the data contained in the string (null ended char array). You should be able to use std::string when working with c functions with no issue, so long as you’re not planning to give ownership of the memory to the function you called.

Reply
1. Kian says:
  
  Wednesday Dec 26, 2012 at 10:43 am
  
  Also, while you can’t just add char arrays, the += operator works fine. That’s to say, the following all works:
  
  #include
  #include
  
  std::string charToString( char const * charArray )
  {
  std::string charString;
  
  charString += ‘F’ ;
  charString += charArray ;
  charString += ” bar”;
  
  return charString;
  }
  
  int main(int argc, char** argv)
  {
  std::cout << charToString( "oo" ) << std::endl;
  
  std::string testString("oo");
  std::cout << charToString( testString.data() ) << std::endl;
  std::cout << charToString( testString.c_str() ) << std::endl;
  
  return 0;
  }
  
  Output is
  
  Foo bar
  Foo bar
  Foo bar
  
  Now, I won't argue that working with strings is in any way simple, but it's not as bad as the post makes it out to be either.
  
  Reply
  1. Shamus says:
    
    Wednesday Dec 26, 2012 at 11:27 am
    
    I actually mentioned this very thing:
    
    “You either need to clutter up the code with a bunch of casting, or you have to break the operation up onto multiple lines. That’s fine. You’ll still get the job done and it’ll still work, but it can be done cleaner and faster in other languages, and with less memory-management pitfalls.”
    
    The point isn’t that C/C++ is a bad language or anything. It’s just something you have to deal with in C and not other languages.
    
    Reply
    1. Kian says:
      
      Wednesday Dec 26, 2012 at 12:34 pm
      
      Ah, I missed the bit about breaking it onto multiple lines. Hmm, could you overload the << operator to do it though? Let me check…
      
      Aha! Here:
      
      std::string& operator<<( std::string & charString, char const * charArray )
      {
      charString += charArray;
      return charString;
      }
      
      std::string& operator<<( std::string & charString, char const & charRef )
      {
      charString += charRef;
      return charString;
      }
      
      std::string charToString( char const * charArray )
      {
      std::string charString;
      
      charString << 'F' << charArray << " bar";
      
      return charString;
      }
      
      That works (with the same main function as before).
      
      I should keep these. I've always been annoyed about how I can't use that with std::strings.
      
      Still, I agree that any language that requires you to write your own code to get the syntax to behave the way you want it to is not easy to use. And this is a non-standard syntax that might confuse people.
      
      Reply
      1. Shamus says:
        
        Wednesday Dec 26, 2012 at 1:05 pm
        
        Nice!
        
        Kind of makes me wonder why that wasn’t done originally. Or perhaps there is already a way to do this that I’ve overlooked? Either way, it’s interesting.
        
        Reply
        
        Volfram says:
        
        Wednesday Dec 26, 2012 at 7:27 pm
        
        D uses the ~ as a concatenation operator. Kain’s block of code above would be:
        
        import std.string;
        
        string charToString( string charArray )
        {
        string charString = “F”~charArray~” bar”;
        return charString;
        //you could also just return “F”~charArray~” bar”;
        
        }
        
        int main(string[] args)
        {
        writeln(charToString(“oo”);
        
        //Arrays of all types are primitive data types in D.
        //D also has a built-in alias of “const char[]” to “string”
        
        string testString = “oo”;
        writeln(charToString(testString));
        //D strings are NOT null-terminated!
        //String.toStringz() returns a proper C-format string with the null terminator added to the end.
        writeln(charToString(testString.toStringz()));
        //I really don’t advise that last one, you’ve got a null right in the middle of your D string. I don’t know how D handles that.
        //If you pass it to a C library, everything after the null will get dropped.
        
        return 0;
        }
        
        Output is
        
        Foo bar
        Foo bar
        Foo bar
        
        The above block of code should be compile-ready. There used to be an online compiler on the D website, but it got taken down…
        
        Reply
      2. Jacob Albano says:
        
        Wednesday Dec 26, 2012 at 1:06 pm
        
        [EDIT] Aw man, the comment I was replying to got deleted. :/ It was something about overloading operator<< to concatenate strings and char arrays, to which I said…
        
        You can do this with std::stringstream ( http://www.cplusplus.com/reference/sstream/stringstream/ )
        
        std::stringstream stream;
        
        stream << "Some text!" << 100 << 13.37;
        
        std::string result = stream.str();
        
        It's not a one-line operation, but it is part of the standard library.
        
        Reply
        
        Kian says:
        
        Wednesday Dec 26, 2012 at 1:32 pm
        
        Not deleted, but I see it now in the moderation queue? Odd.
        
        Yeah, I’ve never been a fan of stringstream. It’s like they saw string was missing some utility, so they created a whole other class to provide the functionality string was missing and make it compatible with streams.
        
        I just want a string class that does what a string class is supposed to do.
        
        Reply
2. Shamus says:
  
  Wednesday Dec 26, 2012 at 11:14 am
  
  The problem is, that data is cast as const, and a lot of old code (most of it, in my experience) is not written with a lot of strict const use. Thus, you’ll have to malloc () yourself a copy.
  
  You can say this is a problem with the language or the way people use it. But either way, it’s a problem you’ll have to face in C, and not in (say) another more string-friendly language.
  
  Reply
  1. Kian says:
    
    Wednesday Dec 26, 2012 at 12:40 pm
    
    Yeah, there’s no way around that. You can’t write to a string unless you pass the string along.
    
    And this is just basic string operations like adding letters to a string. There’s no standard support for switching to upper or lower case, the encoding and locale stuff is an arcane mess that is inscrutable to even experienced coders, there’s no easy way to turn a number into a string, etc. You can code it, but it all has to be hand crafted or you have to hunt for libraries that do it the way you want it to work.
    
    Reply
    1. Tino Didriksen says:
      
      Wednesday Dec 26, 2012 at 8:14 pm
      
      1: “You can't write to a string unless you pass the string along.”
      False. It is guaranteed safe to write to the char* you get from doing e.g. &str[0], so long as you stay within bounds (as with any buffer).
      
      For example:
      std::string buffer(32, 0); // 32 byte buffer size_t newlen = sprintf(&buffer[0], "%d", time()); // store the actual written length for later buffer.resize(newlen); // set string.size() to match the number of used bytes
      
      And you can similarly pass a suitably sized std::string or std::vector to any C API that writes to a given char*. Note that string.reserve() is not what you want here – you need the bytes initialized, not merely reserved.
      
      2: “There's no standard support for switching to upper or lower case”
      C++ inherited C’s tolower() and toupper() so those exist, but I agree they are not exactly great. To change a whole string you need to use std::transform() – and that page even uses toupper() in the example, so won’t duplicate here.
      
      3: “the encoding and locale stuff is an arcane mess that is inscrutable to even experienced coders”
      Agreed. This is where Boost Locale and ICU enter the picture. ICU is the de facto Unicode handling library – basically everyone but Microsoft uses and ships ICU, including Boost.Locale. If ICU can’t handle your locale or encoding need in a cross-platform manner, your need is not of this world. What ICU lacks is a polished interface, which is what Boost.Locale aims to provide.
      
      4: “there's no easy way to turn a number into a string”
      See #1 for the quite simple sprintf() way, which can be turned into 2 lines. There’s also the excellent Boost.Lexical_cast. Or if your compiler is new enough, use std::to_string().
      
      Btw, if you aren’t already using Boost, today is the day to start. It’s a collection of high quality cross-platform libraries that play very nice with the C++ Standard Library. Most of it is header-only, meaning zero runtime dependencies.
      
      Also, if you’re serious about learning the inner workings of C++, hang out in ##C++ and ##C++-general on Freenode IRC for a few months (yes, two #s).
      
      Reply
      1. Peter H. Coffin says:
        
        Thursday Dec 27, 2012 at 8:19 am
        
        On #2, you also start running into a whole *boatload* of character issues almost instantly. Is a byte in a string with value 0xC2 something you can lowercase? Sometimes, sometimes not. You need character encoding awareness to tell, and once you’ve got awareness of character sets then you’ve got to decide which ones you’re going to support or which multi-byte string manipulation library you’re going to use because that IS a project that too big to roll your own in any sane amount of time.
        
        Reply
      2. Kian says:
        
        Thursday Dec 27, 2012 at 8:33 am
        
        About the first point, even if you are working with the buffer you need to know it belongs to a std::string. If you’re working with legacy code that didn’t know about it and tried to realloc the pointer you passed, you’d get in trouble. And if it didn’t return the length because modifying the string was a side effect you’d also have issues.
        
        So you can avoid using the string interface and work directly on the buffer, but your code needs to know you’re dealing with a string. By that point, you might as well be passing the string along instead of a pointer to the buffer.
        
        That was what I meant. Of course, this is a problem with c++ having to remain c-compatible, and using c code.
        
        We’re pretty much in agreement over the rest. I didn’t mean to imply any of these operations weren’t possible, they just require more knowledge than you can expect from a complete newbie. String manipulation was complicated in c, and c++ did little to improve on it; reflected by the fact that you need to return to c functions for many of those operations.
        
        The fact that at this stage of the development of the language they’re still adding string functions for operations such as to_string shows the string interface was very poor to begin with.
        
        Still, thanks for the links, wasn’t aware of some of those. I don’t generally have much need for string manipulation myself.
        
        Reply
  2. Ross Smith says:
    
    Wednesday Dec 26, 2012 at 7:30 pm
    
    The way to pass strings to C APIs that expect a (non-const) char* is to use vector<char> for temporary storage:
```
string oldtext = "Hello";
vector<char> temp(s1.begin(), s1.end());
call_old_c_api_that_modifies_a_string(temp.data(), temp.size());
string newtext(temp.begin(), temp.end());
```
    Also, I tend to use this a lot, because it makes constructing containers easier:
```
#define BOUNDS(x) x.begin(), x.end()
```
    or if you want to get more flexible:
```
#define BOUNDS(x) boost::begin(x), boost::end(x)
```
    Also also, the easiest way to build up a string from a collection of strings and char*s without causing problems if the first few elements are char*s is to start off with an empty string:
```
const char* a = "Hello";
char b = ' ';
const char* c = "world";
string d = "!\n";
auto text = string() + a + b + c + d + e;
```
    Reply
    1. Kian says:
      
      Wednesday Dec 26, 2012 at 9:08 pm
      
      Oh, I didn’t know about that last bit with the empty string. It’s a step in the right direction, although does it work if you do
      
      std::string text = std::string() + a + b + “Hello!”;
      
      That second “Hello!” is not a const char * but a const char[6], which is slightly different. If not, then you’re still having to break it up into several lines.
      
      Reply
      1. Ross Smith says:
        
        Wednesday Dec 26, 2012 at 9:14 pm
        
        Yes, that works just the same.
        
        Reply
Daemian Lucifer says:

Wednesday Dec 26, 2012 at 10:40 am

Well official christmas of my country hasnt arrived yet,but I spent my day with some sweet baldurs gating.

Oh,I think I also spent a few hours with a girl but thats not important.

Reply
Roger HÃ¥gensen says:

Wednesday Dec 26, 2012 at 10:50 am

XML is nice, issue is that you have to be careful to stay very consistent if not it can get messy real quick.

Like in these two examples (for some reason the html code tag is not supported in these comments so here is a pastebin posting instead) http://pastebin.com/v5tmWU3L

Reply
1. Jacob Albano says:
  
  Thursday Dec 27, 2012 at 9:02 am
  
  I disagree with your statement that XML is nice. I admit it’s a good fit for markup, but using it for configuration makes me want to die. I switched to Lua a while ago:http://pastebin.com/DNBy10q1
  
  Reply
  1. Roger HÃ¥gensen says:
    
    Thursday Dec 27, 2012 at 12:41 pm
    
    And how much bloat does Lua add, just to use that?
    Even PHP has a similar ability to write/read it’s data structures. Why not JSON instead then, same thing right?
    
    In a program of mine I use XML config files, but there is also a skinning feature, and that one does use .ini so people can easily make (and tweak) skin features/settings.
    The XML settings (program prefs) are not meant for hand editing (though due to it being XML you can easily look at it in a browser or edit it if needed).
    
    The XML code is used elsewhwere in the program for other features, so the xml prefs files is a bonus in that regard.
    
    If you are using LUA serialization then I assume that your program also uses LUA scripts for program scripting?
    
    Reply
    1. Jacob Albano says:
      
      Thursday Dec 27, 2012 at 2:39 pm
      
      Nope, I’m using Angelscript for scripting. My Lua serialization library is a 225kb static library. Compared to PugiXML, which I used to use, that’s almost 100kb less.
      
      Reply
Deoxy says:

Wednesday Dec 26, 2012 at 10:55 am

I also LOVE ini files – SO SO SO SO SO much easier than a lot of other ways to do it.

But I do have to say, even in that case, I wouldn’t have used GOTO, but I have an extremely high nesting tolerance.

Also, don’t most newer languages have a command that basically replaces “goto end” without using a goto? Break, exit sub, stuff like that. So yeah, a lot of people agree with you on it (enough to make commands just to replace it).

Fun post – thanks!

Reply
1. Bryan says:
  
  Wednesday Dec 26, 2012 at 11:10 am
  
  The problem is, you can’t just return. (Which is what C uses to jump out of the current function.) You have to release the memory you allocated, otherwise you’ll leak a bunch of it every time you see something invalid — this is not likely to work out all that well when being run continuously on a server somewhere. :-)
  
  C++ *sometimes* allows you to try/finally, but not always (there are lots of projects that compile with exceptions disabled, and lots of companies that forbid them in code that’s used there, mostly because exceptions tend to break cleanup in weird ways: http://blogs.msdn.com/b/oldnewthing/archive/2005/01/14/352949.aspx), so that may not be usable either.
  
  This type of code is actually used all over the place in the Linux kernel, as well; there are a surprising number of places where you have to pass up a (perhaps-translated) error code to your caller, while still cleaning up allocations or other state changes you’ve made.
  
  Reply
  1. silver Harloe says:
    
    Thursday Dec 27, 2012 at 5:08 am
    
    functional purists would probably insist the code look like this:
    
    void parse_things ()
    {
    char *thing1, *thing2, *thing3;
    
    thing1 = thing2 = thing3 = NULL;
    parse_things_backend( thing1, thing2, thing3 );
    if (thing1)
    free (thing1);
    if (thing2)
    free (thing2);
    if (thing3)
    free (thing3);
    }
    
    // I don’t remember the pass-by-reference syntax, but that’s what these
    // arguments should be:
    void parse_things_backend ( *thing1, *thing2, *thing3 )
    
    //allocate thing 1
    thing1 = get_next_string ();//makes a copy
    if (whole bunch of complex tests prove that thing1 makes no sense)
    return;
    thing2 = get_next_string ();//makes a copy
    if (more tests to see if thing1 and thing 2 make no sense together)
    return;
    thing3 = get_next_string ();//makes a copy
    if (more tests to see if all three things fail to add up to a proper command)
    return;
    //Yay! The user managed to enter something coherent!
    do_thing (thing1, thing2, thing3);
    }
    
    But functional purists tend to live in a fantasy world where function calls are free instead of time consuming.
    
    Reply
  2. fscan says:
    
    Thursday Dec 27, 2012 at 4:21 pm
    
    In c++, especially c++11 you should almost NEVER explicitly delete stuff. Always manage resources with object lifetime (scope). When there is no specialized class (vector, string) use unique_ptr.
    
    void func()
    {
    unique_ptr data1(new ConstructorMayThrow());
    unique_ptr data2(foo::getBar()); //may return nul
    unique_ptr data3;
    
    if (data2->blub())
    data3.reset(new ConstructorMayThrow());
    
    do_something(data1, data2, data3);
    
    //no need to delete, destructor of unique_ptr takes care of it
    //as soon as it goes out of scope
    //NEVER LEAKS
    }
    
    Reply
    1. Bryan says:
      
      Thursday Dec 27, 2012 at 6:47 pm
      
      Yes, but, C. :-)
      
      In C++, the unique_ptr (and its associated move semantics to handle ownership) solve this pretty well *assuming* the only thing you care about is destroying objects. See the linked article, where he’s talking about adding a reference to the created object into something else (in that case, it’s a list of notification icons; in the earlier article he links to, it’s a reference to a player object stored in the team).
      
      Sometimes you have to clean up some of the object’s post-creation steps, as well as just calling its destructor, since its destructor doesn’t always know all the lists that the object itself had been added to. It would be possible to wrap those post-creation steps in another wrapper object I suppose, but that starts to get really really complicated…
      
      Reply
      1. fscan says:
        
        Friday Dec 28, 2012 at 8:26 am
        
        Yes you can write bad code in every language.
        The most important thing is to think about who is owning a resource and therefore responsible to clean up. C++ makes it really easy to deterministically (is this as word?) clean up resources by providing value semantics (and therefore defined scope) through wrapper classes, which by the way have almost none to zero overhead at all (depends how good your compiler is).
        And you get exception safety for free if you stick to RAII
        
        Reply
        
        Bryan says:
        
        Friday Dec 28, 2012 at 10:40 am
        
        Yes, but you can’t always do RAII, is what I was saying.
        
        “Cleanup” is not exclusively “deallocating the memory for the object”. It also includes “removing pointers to the object from any random other lists of pointers that it may belong to”, which is impossible to do from the object’s destructor.
        
        And I don’t think you can write that off as “bad code” — see the two linked oldnewthing posts for two perfectly legitimate cases where this type of cleanup is needed. The OS maintains a list of notification icons (so that it can, you know, display them :-P), and the Team class maintains a list of its Player objects (so that it knows who’s on which team). Both of these lists need to be cleaned up, otherwise you’re going to crash when calling a method on the class in the list, and passing a “this” pointer whose memory has been deallocated.
        
        (I suppose the notification-icon class could clean up the OS’s list. It’ll still leak the HICON, but that’s because .net is silly. The Team/Player pair still need manual cleanup outside the destructor though, especially in the case of multiple threads…)
        
        Reply
        
        fscan says:
        
        Wednesday Jan 2, 2013 at 11:35 pm
        
        Late reply .. i mean, for things like this you can always use shared_ptr.
        eg:
        struct SystemNotification {
        shared_ptr<Notification> n;
        
        SystemNotification(shared_ptr<Notification> const &ptr) : n(ptr)
        { //add to system tray }
        
        ~SystemNotification() { //remove from system tray }
        }
        
        Personally, i try very hard not to use shared pointer. I like to know which class owns an object and with shared pointer this can get very unclear. But like you said, if you work with an external api sometimes you have no choice :)
        
        edit: site ate my brackets
        
        Reply
2. Rick says:
  
  Thursday Dec 27, 2012 at 2:09 am
  
  The first dynamic website I built (I was a kid) used ini files for user data instead of databases. So many years ago.
  
  Reply
Rosseloh says:

Wednesday Dec 26, 2012 at 11:13 am

Apart from the whole “spending time with family” thing (No, I’m not particularly social, what clued you in?), my Christmas was pretty good.
At least the family part wasn’t an all-day affair. We went to see the Hobbit (a 3rd time for me) and then I was able to go home and just play games the rest of the day.

Reply
Bryan says:

Wednesday Dec 26, 2012 at 11:21 am

As for a way to read / write .ini files — I had to implement that in C++ when I ported all three of Terrain, Pixel City, and Frontier to Linux. From Pixel City, see e.g. https://github.com/BryanKadzban/pixelcity/blob/master/Ini.cpp

Not sure if that would work (it does seem a bit fragile, especially at handling user input), but once the {Get,Set}ConfFileEntry template functions are written, it’s actually a few lines fewer to use than the {Write,Read}PrivateProfileString functions.

Although… hmm. It looks like this doesn’t support sections either. I thought it did?

Aha, the Frontier version did; this earlier version did not. This one instead:

https://bitbucket.org/bryankadzban/frontier/src/67c40a43552b/Terrain/Ini.cpp?at=default

will do sections (though it won’t preserve sections). It also requires C++ for the templating of course. But if it’s useful, please use it. :-)

Reply
1. Bryan says:
  
  Wednesday Dec 26, 2012 at 12:31 pm
  
  Woops. It’ll preserve sections just fine. But it won’t preserve line breaks, or comments. Sigh, brain got disconnected from the typing there for a bit, and I didn’t notice until re-reading it now.
  
  Reply
  1. Roger HÃ¥gensen says:
    
    Thursday Dec 27, 2012 at 12:53 pm
    
    I wouldn’t worry too much.
    
    If you need comments in a config file then it’s no longer just a config file. (though comments that are in the default.ini is fine and many games do this) A separate documentation file might be in order then.
    
    As to sections, they are both a blessing and a curse, how should the parser handle duplicate sections (but with different variables), are they added (how to handle numbers then, can they be comma separated, etc.) And how to handle the lack of a section (as a “” maybe?)
    
    At this point looking at XML or some other standard is suddenly no longer so silly.
    A program I’m working on the skins for it has .ini files for parameters, no sections at all, and no comments.
    This is how it looks:
    
    name=Tir
    left_x=0
    left_y=52
    left_w=360
    left_h=168
    left_angle=0.0
    left_rgba=$00000000
    left_autofade=1
    right_x=0
    right_y=52
    right_w=360
    right_h=168
    right_angle=0.0
    right_rgba=$00000000
    right_autofade=1
    db_x=80
    db_y=152
    db_rgba=$dfcf3f7f
    db_size=14
    shimmer=1
    
    For simple configs like this the .ini is unbeatable, for something larger and standardized, and which can be interchangable among multiple platforms and different software (like in my case) XML is worth it, especially if the XML code is used previously in the software.
    
    Keep the .ini as basic as possible, that is partly the reason why it’s been so popular, start adding bloat and you get issues later.
    
    Reply
Roger HÃ¥gensen says:

Wednesday Dec 26, 2012 at 11:24 am

For those not seeing the issue with .ini file parsing, take a peek at http://en.wikipedia.org/wiki/INI_file
That’s as close as you get to a “official” standard (please note the “” as there is no actual .ini standard)
And then there are variations not shown on that wikipage at all.

INI and XML only structures the data, it does not describe it, so regardless if .ini or .xml is the file format the content is always program/application specific.

That inilib seem nice but has a few issues. It’s GPL (and not LGPL, or BSD or MIT/zlib/PNG license) so it may or may not be possible to use depending on the project and the way it’s distributed.

Also the source for that inilib is over 1.5MB which is insane (I’m a stickler for really small efficient and logical code with tight minimalistic loops) and yeah I know, the build/configure environment eats up a lot of the space, but I consider that part of the source (as you usually need it to build it).

That lib is incomplete (just look in the TODO), it even says that comments can be lost and that full section/key name support isn’t there. At a glance it looks like feature creep (there is mention of replication certain Windows Registry features).

Which such code size and complexity then one might just as well go for TinyXML2 instead http://sourceforge.net/projects/tinyxml/
Which is fully standards compliant and any XML viewer/web browser or editor that support XML can read/write the .xml file generated.
It is also way smaller (1.2MB and 80% of that is the documentation) than that inilib. The license is also the very liberal zlib (MIT/libPNG) license.

Reply
lethal_guitar says:

Wednesday Dec 26, 2012 at 11:31 am

I found the Boost string library extremely useful for tasks like this. It has most of the stuff std::string lacks, like joining/splitting, lexical_cast (for number conversion), and so on.

And thanks to Linux package managers, it is also extremely easy to setup and use.

Reply
el_b says:

Wednesday Dec 26, 2012 at 11:49 am

I’d been watching the livestream xcom game from a while ago and the site Is it saying that a lot of the videos on rutskarns channel don’t exist anymore, was wondering if the site was pulling a viddler or something.

http://www.livestream.com/chocolatehammer/video?clipId=pla_9203ab12-615a-41fb-9207-ee17c7622c43

Reply
Urs says:

Wednesday Dec 26, 2012 at 12:09 pm

Christmas Eve is the day of celebration here and since the nearest bit of (other) family lives more than 800 kilometers away, it was just the three of us (me, my girlfriend and my daughter) having a cosy evening.

On your Christmas I spent about one fruitless hour trying to figure out why my program (visual programming here: “patch”, actually) runs perfectly fine unless I fullscreen the ouput which seems to introduce a mysterious something-like-a-buffer out of nowhere. And I’m talking logic defying timetravel mystery here. sigh.

Reply
Paul Spooner says:

Wednesday Dec 26, 2012 at 12:14 pm

My Christmas was really good! We’re visiting family (my parents and my wife’s parents live about a block apart) and it was good to see my brothers again. Didn’t get much stuff, but it’s ceasing to be about that anymore, which is an interesting transition to observe.
On the other hand, Hanging out with the inlaws is always a bit… trying. I have trouble having a good time, which makes it hard for my wife to have a good time as well. That, and I feel like my kids pick up bad habits from then, which are only going to result in hours of re-training. Oh well.
So, mixed bag. Learning to be tolerant (the hard way). Enjoying old friends.

I’ve enjoyed writing parsers as well, but mostly in Python, which has none of these frustrations. I wonder, could you write the parser in another language (PHP or Perl or Python or something) and then build a link to the module from your C library?

Reply
Jacob Albano says:

Wednesday Dec 26, 2012 at 12:29 pm

I’ve been using Lua for data storage in my current project. I’m not actually a fan of Lua as a scripting language, but it has a really nice syntax for data structures.

Here’s a little library I made to help out:
https://bitbucket.org/jacobalbano/luatable/overview

It’s not quite perfect yet, but it’s served me well so far.

Reply
Julian says:

Wednesday Dec 26, 2012 at 12:38 pm

That goto construct can be avoided by breaking out of a one-round loop, (probably do { } while(0);) but really, the gotos are clearer about your intent.

Reply
SteveDJ says:

Wednesday Dec 26, 2012 at 1:10 pm

Somewhat OT, but I just had to post because just today I encountered two of your comment-counter messages that were truly fascinating to me (40, and 77 — and yes, by posting this, 40’s message is gone now).

Have you ever posted a list of all the messages you’ve built into your comment counter? Thinking about the coding behind it, it must be a huge switch statement… actually maybe not a switch, as some messages connect to a single number (simple EQUALS test) while others connect to a range of numbers (perhaps GREATER THAN OR EQUAL test?). Hmmm, this alone could be an interesting post someday… :-)

Oh, and I had a lovely Christmas! Managed to make it all the way to 8am before being woken up… :-)

Reply
1. WJS says:
  
  Wednesday Mar 15, 2017 at 7:38 pm
  
  You could do that with a switch, but it would be pretty sparse with a lot of fallthrough. An elseif chain might be better (although the switch would be slightly easier to make changes to).
  
  Reply
Lord Nyax says:

Wednesday Dec 26, 2012 at 1:19 pm

I had an excellent Christmas because I received The Witch Watch as a present! I spent the rest of Christmas reading it (and waiting for Steam games to download on a 100kbs connection). Great read, and I was glad to be able to support my favorite blogger. Here’s hoping you don’t lose the motivation to finish your current book; if you keep it up then maybe my future Christmases will be as nice.

Reply
silver Harloe says:

Wednesday Dec 26, 2012 at 3:31 pm

It’s kinda nice in Windows that Most (but not all) things use one kind of file (.ini) for config. Linux is based on a much more… uh, varied background. So if you want to administrate it, you need to know how a billion different config files look. A maze of twisty little passages, all slightly different.

Reply
1. WJS says:
  
  Wednesday Mar 15, 2017 at 7:46 pm
  
  Whuh? What kind of crazy programs are you using? Most of the ones I have are just ini files, except they might be named .conf instead. The rest are basically ini files except with whitespace instead of an = sign.
  
  Reply
decius says:

Wednesday Dec 26, 2012 at 4:43 pm

Is it just me, or should there be a WritePrivateProfileString function for each operating system that works exactly like the windows one does (right down to the middle-endian portion)?

One person has to make one function, once. Isn’t that what the internet is for?

Reply
1. Roger HÃ¥gensen says:
  
  Wednesday Dec 26, 2012 at 8:22 pm
  
  That’s the problem with standards, there is always more than one.
  
  But I agree, why a Linux API for something basic like that does not exist I have no idea.
  
  Heck, Windows XP (maybe earlier) and later has MSXML so you could leverage the OS XML API, no idea if Linux and MacOS has something similar.
  
  Maybe if there was a POSIX standard for .ini it might have happen *shrug*, I suspect .ini will remain a “underground” standard forever.
  
  Reply
  1. Bryan says:
    
    Thursday Dec 27, 2012 at 12:04 pm
    
    Similar to MSXML? Yeah, there’s either SAX (if your schema is simple enough that a single callback per tag (IIRC anyway) and a serial walk of the text file can work) or libxml (which generates an entire DOM tree and gives you the ability to process it — this is much more expensive in terms of time and memory than SAX’s setup, but is also a lot more flexible since you’re not limited to a serial tag walk).
    
    At least libxml is installed on almost every system; I believe SAX is about as widespread.
    
    Reply
  2. decius says:
    
    Thursday Dec 27, 2012 at 7:31 pm
    
    If there are multiple competing .ini parsers, all the better: Choose which one that works makes the most sense for you. It’s still a single include that gets you functions that you call that (allegedly) do what you want them to.
    
    Plus, if you are WRITING the program in question, the .ini files shall conform to whichever standard you make them.
    
    Reply
    1. WJS says:
      
      Wednesday Mar 15, 2017 at 7:48 pm
      
      Multiple libraries is fine. Multiple standards is not. (Obligatory xkcd)
      
      Reply
2. AyeGill says:
  
  Friday Dec 28, 2012 at 4:31 am
  
  relevant
  
  Reply
TehShrike says:

Wednesday Dec 26, 2012 at 4:56 pm

I’ve been using JSON for my config files recently. It’s easy to read and edit, and the parsers are generally plentiful. :-x

Reply
Mephane says:

Wednesday Dec 26, 2012 at 5:48 pm

Shamus, now I also feel like I should write an INI parser just for fun. Am I weird?

Reply
Tino Didriksen says:

Wednesday Dec 26, 2012 at 8:51 pm

Challenge accepted! A C++ .ini parser and writer in 50 lines of code, including a handy trim function to deal with those pesky extra whitespaces humans may leave in the file: http://ideone.com/L8ryzi

Since ini contains no type information, I just store it all in strings. If keys or values can be any binary blob, then more code is needed, but not that much more…just one more ~5 line function.

Naturally this is not meant to be used in real world code, but it definitely shows what C++ can do.

Reply
Norman Ramsey says:

Wednesday Dec 26, 2012 at 11:17 pm

My Christmas did not have enough coding in it. But about those goto statements. A parser is a state machine. All parsers are built from state machines underneath. And if you’re coding in C, the standard way to implement a state machine is using goto. It is actually more readable than the alternative. And I tell you, as someone who teaches compilers and programming languages at university level, it is even orthodox. (Although if you happen to be parsing an LL(1) EBNF grammar, it is even more orthodox to use classic recursive descent with if statements and while loops.)

My holidays need to have more coding in them.

Reply
1. Bryan says:
  
  Thursday Dec 27, 2012 at 12:13 pm
  
  Behold, a state machine written with switch/case instead of goto:
  
  http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html
  
  Also coroutines, implemented in portable C.
  
  :-P
  
  Reply
Sydney says:

Thursday Dec 27, 2012 at 8:13 am

Why is “goto” so forbidden? As a non-coder who spends time in geek culture (xkcd introduced me to the concept), I don’t get why something would i) Be so taboo, yet continue to ii) Exist.

Reply
1. Kian says:
  
  Thursday Dec 27, 2012 at 8:53 am
  
  goto breaks the flow of the program, by instructing the program to go to any arbitrary point in the code.
  
  Imagine reading a book whose pages were out of order, and at the end of each you were told “continue on page x”. Kind of like a choose your own adventure book, only without multiple endings — so a choose your own adventure book written by Bioware *zing!*.
  
  That’s what making sense of a program that uses goto is like.
  
  As to why it’s used, it’s because sometimes it’s easier on the person writing the code to force a jump rather than plan the program ahead and design a clear flow that does what he needs. Or they are dealing with feature creep and can’t afford a rewrite.
  
  Essentially, it’s a necessary evil. The best you can hope for is that the goto is reduced to a “continue on the next page” at the end of every page.
  
  Also, I challenge anyone to find a car analogy for goto. I couldn’t :D
  
  Reply
  1. Asimech says:
    
    Thursday Dec 27, 2012 at 11:41 am
    
    Not a car, but traffic:
    
    Goto is like road work that forces a detour. Possibly necessary, but a pain in the butt.
    
    Reply
    1. Roger HÃ¥gensen says:
      
      Thursday Dec 27, 2012 at 12:57 pm
      
      Or in geek speek, with GOTO if you mess up you can easily crash shit, you need to be very careful about registers in use and the stack etc.
      
      Hmm. If you need to use GOTO (or similar) then ASM might just as much sense at this point. (you have to code with the same care then as well).
      
      Reply
      1. Tino Didriksen says:
        
        Thursday Dec 27, 2012 at 1:42 pm
        
        “with GOTO if you mess up you can easily crash shit, you need to be very careful about registers in use and the stack etc.”
        No, goto is guaranteed safe to use in C and C++ – it will unwind the stack properly and everything. You’re maybe thinking of longjmp()?
        
        Reply
      2. fscan says:
        
        Thursday Dec 27, 2012 at 3:53 pm
        
        GOTO cannot jump outside the current function, so i don’t know how the stack would be relevant in any way.
        I don’t know what’s the fuzz about it, if your functions are so big that you have to search where a GOTO goes to, they are too big anyway :)
        that said, i never use it myself .. it reminds me to much of the old BASIC style :)
        
        Reply
  2. X2-Eliah says:
    
    Thursday Dec 27, 2012 at 5:58 pm
    
    Breaks the flow of a program?
    
    What about object-oriented programs, then? then ones that rely on conditional states and properties and interactions of entities and now a single fixed ‘program flow’ – why is that not a taboo, becuase I sure as hell can see oo code breaking “program flow” just as easily.
    
    Reply
    1. Asimech says:
      
      Friday Dec 28, 2012 at 7:07 pm
      
      I think I’ve heard someone define object-oriented programming as “socially acceptable goto”.
      
      Reply
  3. mystran says:
    
    Thursday Dec 27, 2012 at 7:01 pm
    
    The main problem with goto isn’t really with the jump, but rather with the fact that in a imperative “state mutating” program, it’s generally easier to track the state with structured code. To reason about correctness of the code at a given label, you have to find all the goto statements that jump to that label.
    
    This isn’t much of a problem when goto is used for something like non-local exit in a few well-defined places, but it becomes a huge pain if you have lots of labels jumping back and forth in apparently random order. This is what structured programming was designed to eliminate. In many large C code-bases you can still find a few goto-statements here and there where they improve the clarity, but these are almost invariably non-local exists or certain types of error-handling code (there are some other cases where it’s hard to avoid code-duplication without goto, but usually you can split the code in question to another function as a cleaner solution).
    
    Curiously, tail-calls in functional languages are more or less “goto” statements with arguments, but since programs in such languages are generally structured to use function calls and “binding” rather than sequential execution and “mutation” the whole problem largely goes away. You no longer need non-local information to reason about the code (the lexical environment is sufficient), so there isn’t similar problem with jumping around. You can still take it too far, but it’s significantly less of a problem.
    
    Reply
fscan says:

Thursday Dec 27, 2012 at 4:02 pm

For config files there’s a nice small library called libconfig (http://www.hyperrealm.com/libconfig) with an c and c++ interface. the files look a bit like json. it’s available on almost any linux distribution. I already used this in production code.

Reply
IudexFatarum says:

Thursday Dec 27, 2012 at 4:47 pm

Personally I prefer Java for the exact reason of being OS independent. It does use ini files occasionally (e.g. eclipse) and being OO it allows for quite a bit of flexibility.
Any code that’s re-used becomes its own method so no gotos are needed, and sanitation falls on the object that is being set instead of on the parser. So for example in your system the section up to the first = char is what matters, it just reads to the end of line (which should read any EOL character), shoves that into a string, and passes to the object to be set and its a black box how “[[[masta killa187]]]” is dealt with. The same can be done with C++ if you start from a heavy OO perspective.

Reply
Rolf Andreassen says:

Thursday Dec 27, 2012 at 6:55 pm

You should have more time off from your day job. This is the sort of post I read this blog for. :)

Reply
Roger HÃ¥gensen says:

Thursday Dec 27, 2012 at 9:30 pm

*looks around* Is it just me or is there a lot of coders around in this part of the year? I certainly can’t recall this many coders in previous posts, either that or the non-coders passed out (or has a life) *laughs*.

Reply
1. Zukhramm says:
  
  Friday Dec 28, 2012 at 3:39 am
  
  They come out during the time of year when the sun’s out as little as possible.
  
  Reply
2. Kian says:
  
  Friday Dec 28, 2012 at 7:49 am
  
  It’s not so much the time of year as it is the subject. When Shamus is talking about his own projects, there’s not much to add. But nothing gets coders riled up as much as giving opinions on a language. Just mention a preference for an IDE, or language, or compiler, and step back.
  
  Reply
  1. Roger HÃ¥gensen says:
    
    Friday Dec 28, 2012 at 6:04 pm
    
    Sure. Then again, the crowd here at Twenty Sided is pretty darn nice compared with elsewhere, so such discussions are actually nice to have here.
    
    Oppa Shamus Style?
    
    Reply
ps238principal says:

Thursday Dec 27, 2012 at 11:29 pm

I’m forever associating “parser” with Infocom text-adventure games.

I blame my C-64.

Reply
1. Exasperation says:
  
  Saturday Dec 29, 2012 at 4:40 am
  
  Personally, I always associate “parser” with Dr. Seuss.
  
  Reply
Double says:

Monday Dec 31, 2012 at 7:12 am

” (If this never happens you you, then you're probably a student or working in a really cutting-edge environment where you never have to interface with code that's more than a few years old. Most of us, sooner or later, need to use a char*.) ”

Is wrong.
It is not hard to write c++ programs where you communicate with old software even with the ‘new’ std::strings

std::string in c++ are basically wrappers around character pointers, and they interface just fine with old char * code. First of all standard string CAN be initialized from char pointers, and they CAN just as easily be converted to a char * pointer.
That is the whole point with c++ strings, they are compatible with old c code.

Look up the string constructor and string.c_str().

std::strings are often prefered to character pointers as they are self contained and handle memory themselves. Compilers can also do some very aggressive optimization on objects, which can’t be done on character pointers.

Reply
1. Shamus says:
  
  Monday Dec 31, 2012 at 12:39 pm
  
  Like I said, try using this in a system where you need to pass around chars* that aren’t const. You’ll have to stop and allocate memory and do a copy every time. once you have to deal with:
  
  int DoThing (char* parm1, char* parm2, char* parm3);
  
  …you will find yourself in a world of annoying clutter.
  
  Reply
  1. Tino Didriksen says:
    
    Tuesday Jan 1, 2013 at 6:18 am
    
    If you have int DoThing(char* parm1, char* parm2, char* parm3); and std::string str1, str2, str3; with some data in them, then you can call that function using DoThing(&str1[0], &str2[0], &str3[0]);.
    
    No need for extra allocations. Sure it looks odd, but it works and is much cleaner than manual memory management.
    
    The only case where you have to deal with char* is when a C function returns a freshly allocated buffer that you have to manage, such as with the non-standard strdup().
    
    Reply
    1. Shamus says:
      
      Tuesday Jan 1, 2013 at 9:46 am
      
      !!!!!!
      
      I did not know about that. I spend a lot of time fiddling with with old char* code and string, and I’d never seen that used. Too bad. Would have saved some headaches.
      
      I wonder what happens if DoThing () alters one of those strings. Hmmm…
      
      Reply
      1. Bryan says:
        
        Tuesday Jan 1, 2013 at 5:29 pm
        
        Bad, bad, bad things, I’d expect — this is exactly the kind of thing that seems like it’ll only work if the std::string class’s operator[], and its iterators, incidentally, are both implemented naively. The SGI STL reference doesn’t say anything about this either way. :-/
        
        If I had a copy of the C++ standard handy I’d go look for this in there (though I wouldn’t be surprised if it weren’t there either, since I don’t know if STL semantics are specified there or not).
        
        But I’m pretty sure it’s perfectly valid to actually store the data in a string in a bunch of disjoint buffers; this would be useful when trying to do zero-copy I/O, for instance. (Methods like append() and insert() would find this layout extremely efficient as well.)
        
        The offset passed to operator[] would determine which of the internal buffers to use based on the size of each buffer, and then the “rest” of the offset would determine which charT instance to return a reference to. But taking the address of that reference would trivially break if the code you’re passing the pointer to assumes that the layout is sequential.
        
        With this string implementation, .data() and .c_str() would both allocate a new buffer (…though destruction of that buffer would potentially be complicated; it’d have to be tied to the string instance somehow, but .c_str() already has this problem since it has to include a zero byte at the end, while .data() and the string instance itself do not), copy from each of the internal buffers into it, and return a pointer to its first byte. So those don’t actually require any given internal representation…
        
        Reply
        
        Tino Didriksen says:
        
        Wednesday Jan 2, 2013 at 4:02 am
        
        It is guaranteed safe as per the C++ Standard, both 1998 and 2011 editions.
        – It is explicitly stated in C++ 2011 that std::string must be stored contiguously and with an extra null termination byte.
        – In C++ 1998, this is not explicitly stated, but it is implicitly the case due to complexity requirements and other parts.
        – All implementations do it the safe way, anyway, which formed the basis for explicitly requiring it in C++ 2011.
        
        As for modifying an std::string via a C function, my post way above showed how to:
        std::string buffer(32, 0); // 32 byte buffer size_t newlen = sprintf(&buffer[0], "%d", time()); // store the actual written length for later buffer.resize(newlen); // set string.size() to match the number of used bytes
        
        You don’t have to .resize() after if you’re using some other way to track used length (even relying on null termination), but I prefer to let std::string handle it.
        
        Reply
Anachronist says:

Tuesday Jan 1, 2013 at 2:49 pm

Thanks for validating my views regarding goto. And thinking back, where did I use goto extensively? Parser code! I wrote a C++ program called JACOsub that was popular in the Amiga anime subtitling community in the 1990s. It included a parser that interpreted a rather complex subtitle script format that included multiple timing formats as well as codes for font selections and positioning and formatting of subtitles.

It always bothered me a bit that I had to rely heavily on goto while interpreting a script. Not only that, but I often used goto to jump to different sections inside a switch() statement to avoid duplication and keep my code size small (the whole package had to fit on a 880K floppy diskette and occupy less than a megabyte of memory while running).

Never having seen anyone else’s parser code before, I wondered if I was doing the right thing. I mean, my source code made perfect logical sense even with all those goto statements, and my code was tight and efficient, so I figured it was OK. I’ve had was this nagging feeling about what other programmers would think if they saw my source, but then again, I couldn’t see how I’d do it any other way that made as much sense. A dozen or so years later, you have lifted that small weight of uncertainty from my mind. Thanks!

Reply
Cuthalion says:

Wednesday Jan 2, 2013 at 7:52 am

I’m not sure I know why, but this is one of my favorite posts in a long time. I think I just enjoy reading your explanations of things? Or maybe it’s because I’m going to have to write a scripting parser soon? Whatever it was, it was entertaining.

Reply
Andy L says:

Monday Mar 11, 2013 at 2:55 am

I have to agree with the people above who recommend the Boost library.

I feel like it’s everything that should have been included in C++ to make it a competitive modern language. (Compared to Java or C# for example.) And it’s an extremely well-behaved library, it makes no assumptions and doesn’t force the programmer’s hands.

Their, one-line parsers for XML, INI, INFO, and JSON are a great example.

Check out this “Five minute tutorial” for loading config files with the Boost libraries. The functions to load and save the data are 7 lines each!

Reply

Thanks for joining the discussion. Be nice, don't post angry, and enjoy yourself. This is supposed to be fun. Your email address will not be published. Required fields are marked*

You can enclose spoilers in <strike> tags like so:
<strike>Darth Vader is Luke's father!</strike>

You can make things italics like this:
Can you imagine having Darth Vader as your <i>father</i>?

You can make things bold like this:
I'm <b>very</b> glad Darth Vader isn't my father.

You can make links like this:
I'm reading about <a href="http://en.wikipedia.org/wiki/Darth_Vader">Darth Vader</a> on Wikipedia!

You can quote someone like this:
Darth Vader said <blockquote>Luke, I am your father.</blockquote>

T w e n t y S i d e d

Coding a Parser

112 thoughts on “Coding a Parser”

Leave a Reply Cancel reply

T w e n t y S i d e d

Starcraft 2: Rush Analysis

Dead Island

The Best of 2018

PC Hardware is Toast

Do It Again, Stupid

112 thoughts on “Coding a Parser”

Leave a Reply Cancel reply