{"id":18188,"date":"2012-12-26T08:09:51","date_gmt":"2012-12-26T13:09:51","guid":{"rendered":"http:\/\/www.shamusyoung.com\/twentysidedtale\/?p=18188"},"modified":"2012-12-26T08:49:27","modified_gmt":"2012-12-26T13:49:27","slug":"coding-a-parser","status":"publish","type":"post","link":"https:\/\/www.shamusyoung.com\/twentysidedtale\/?p=18188","title":{"rendered":"Coding a Parser"},"content":{"rendered":"<p>I spent Christmas day coding. That was fun. As part of my efforts to move to Linux, I decided to port some of my code. One of the first things I&#8217;ll want in the world of Linux is the ability to read .ini files.<\/p>\n<p>I really like .ini files. You can put any program settings in them in any order.  You can edit them with a text editor. You can read and write to them from within your program. This is much better than (say) storing all your settings in binary files.  Some people are moving to XML these days, but XML files are massive overkill for a job like this, and end up being incredibly verbose and annoying for humans to read.  For context, here is the .ini file for <a href=\"?p=11874\" title=\"Project Frontier #1: Getting Started \">Project Frontier<\/a>:<\/p>\n<p><!--more--><\/p>\n<pre lang=\"ini\">[Settings]\r\nTreepos=3611.557861 2473.824219 8.986277\r\nTreeSeed=654\r\n\r\n[Animations]\r\nIdle=idle\r\nRunning=run\r\nSprinting=run\r\nFalling=fall\r\nJumping=fall\r\nSwimming=fall\r\nFloating=idle\r\nFlying=idle\r\n\r\n[Avatar]\r\nCameraDistance=11.00\r\nAngle=76.000000 0.000000 73.199890\r\nPosition=7806.417969 4053.380615 -0.217506\r\nFlying=0\r\nMouseSensitivity=1.00\r\nInvertY=1\r\n\r\n[Shaders]\r\nShaderNormal=standard.cg\r\nShaderTrees=trees.cg\r\n<\/pre>\n<p>The drawback of .ini files is that they&#8217;re basically a Windows thing. If you&#8217;re writing code targeted at windows, then you can change one of the above settings like so:<\/p>\n<pre lang=\"c\" line=\"1\">\r\n\/\/Specify section, then entry, then the new value, then the file being changed.\r\n\/\/I have no idea why the inputs are in this order. Wouldn't it make more sense\r\n\/\/to list the FILE first, so they're in ascending levels of specificity?\r\n\/\/But whatever....\r\nWritePrivateProfileString (\"Shaders\", \"ShaderNormal\", \"cellshade.cg\", \"frontier.ini\");\r\n<\/pre>\n<p>Perhaps in your code you don&#8217;t put paragraph-sized editorials in the comments? I dunno. That&#8217;s your business. <\/p>\n<p>The problem is that <tt>WritePrivateProfileString ()<\/tt> is not available on other platforms. If you want to use ini files elsewhere, then you need to write your own version of these. This means writing a text parser. Text parsers can sometimes be kind of fiddly.<\/p>\n<p>The big problem is that C++ is not the best language for juggling text. In fact, it might be one of the worst. Yes, you can use std::string. That&#8217;s not so bad. But sooner or later you&#8217;ll want to pass around char* strings. (If this never happens you you, then you&#8217;re probably a student or working in a really cutting-edge environment where you never have to interface with code that&#8217;s more than a few years old. Most of us, sooner or later, need to use a char*.) When that happens you&#8217;ll need to make a char*, allocate some memory, copy the std::string, and then forget to free () the memory later because screw this language, man. And even when you&#8217;re free to use std::string, you still wind up with situations where you can&#8217;t do simple things like this:<\/p>\n<pre lang=\"c\" line=\"1\">\r\nvoid HeaderName (const char* section_name)\r\n{\r\n  string   section_header;\r\n\r\n  \/\/This is not allowed:\r\n  section_header = \"[\" + section_name + \"]\" + EOL;\r\n\r\n  \/\/Presumably we would do something more here. Maybe return a value or something?\r\n  \/\/Don't judge. YOU try writing plausible example code!\r\n}\r\n<\/pre>\n<p>You either need to clutter up the code with a bunch of casting, or you have to break the operation up onto multiple lines. That&#8217;s fine. You&#8217;ll still get the job done and it&#8217;ll still work, but it can be done cleaner and faster in other languages, and with less memory-management pitfalls.<\/p>\n<p>But the real horror show begins when you have to maintain a parser written in vanilla C. No std::string to help you.  No <tt>new<\/tt> and <tt>delete<\/tt> to make allocating memory easier and less dangerous. When you build a parser in C, you are cutting down a tree with a utility knife. <\/p>\n<p>In my Activeworlds days, I had to maintain such a parser. It had been written in 1994 or so using nothing but base C.  Also, the material being parsed was particularly troublesome. It was a scripting language that&#8230;<\/p>\n<ol>\n<li>&#8230;was often written by end users. The parser needed to be VERY forgiving of errors or it would drive people crazy and be too large of a challenge for the average person to learn. If possible, one bad command shouldn&#8217;t prevent subsequent commands from executing.\n<\/li>\n<li>&#8230;was used in bulk. The users went around the world tagging objects with scripts. For example, the picture frame on the wall might change images when clicked, or bumping into an object would teleport the user to a new place. Scenes were frequently made of thousands of objects, and all of this was taking place in the late 90&#8217;s, before the days of ubiquitous graphics acceleration.  Every CPU cycle cut directly into the framerate, so the parser needed to be as fast as possible.\n<\/li>\n<li>&#8230;was designed to be as terse as possible. All of this data was flowing down the user&#8217;s 28.8k modem connection. Users, being users, will naturally expand to use ALL available space.  Without limits end users will never stop packing in data. If you let them write a megabyte-long script, they will.  And then they will copy &#038; paste that script onto every object in the vicinity. So each object was limited to 256 bytes. With limits like this in place, every single byte matters. Certain fields need to be optional. There need to be abbreviations.\n<\/li>\n<\/ol>\n<p>For example, one thing I added was the <a href=\"http:\/\/www.activeworlds.com\/help\/aw41\/document.php?rotate_command\" title=\"The Rotate Command - Activeworlds Help\">rotate command<\/a>. <\/p>\n<pre lang=\"ini\">rotate [x] y [z] [sync OR nosync] [time=time] [loop OR noloop] [reset OR noreset] [wait=wait] [name=name] [smooth]<\/pre>\n<p>You could use this to make an object spin on all axis, like so: <\/p>\n<p><tt>rotate 10 20 30<\/tt>. <\/p>\n<p>If I remember correctly, those numbers were expressed in RPM.  That command would spin the object 10 times a minute on the X axis, 20 times a minute on the Y, and 30 times a minute on the Z.  However, the vast majority of cases where the command is used are just to make an object spin on the Y axis.  (Like a merry-go-around.) So instead of wasting FOUR WHOLE BYTES specifying zeroes for X and Z, you&#8217;re optionally allowed to specify only Y. <\/p>\n<p>After the numbers(s), you have a list of directives that may or may not be used. Using the rotate command, you could make a door that swung open an closed. You could put hands an an analog clock that would always show the correct time. You could make a blade that swings back and forth, Skyrim dungeon-trap style. You could make a spinning helicopter blade.<\/p>\n<p>That&#8217;s a lot of power in a very small command, with the downside being that it&#8217;s a pain in the ass to parse. <\/p>\n<p>Parsers usually work by taking a block of text and breaking it up by whitespace. (Whitespace is any non-printing character. Spaces, tabs, line feed.) Extra whitespace is ignored, so <tt>rotate 5<\/tt> is identical to <tt>rotate&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;5<\/tt>.  This is how pages are parsed by your web browser.  It&#8217;s how my C++ compiler reads my code. It&#8217;s how ini files, css files, and MS-DOS batch files are read. <\/p>\n<p>Parsing code remains the only place I have ever seen the forbidden <tt>goto<\/tt> used in production code. In many parsing situations, once you&#8217;re past a word you can discard it, but in this parser there were situations where you needed to save values for later. So the parser would allocate a whole bunch of stuff, saving values while reading a command that could end in error at any time. I no longer have access to the codebase, and I&#8217;m pretty sure the whole thing was re-written at some point in the last few years, but back in the day I remember seeing something like:<\/p>\n<pre lang=\"c\" line=\"1\">\r\nvoid parse_things ()\r\n{\r\n  char *thing1, *thing2, *thing3;\r\n\r\n  thing1 = thing2 = thing3 = NULL;\r\n  \/\/allocate thing 1\r\n  thing1 = get_next_string ();\/\/makes a copy\r\n  if (whole bunch of complex tests prove that thing1 makes no sense)\r\n    goto done;\r\n  thing2 = get_next_string ();\/\/makes a copy\r\n  if (more tests to see if thing1 and thing 2 make no sense together)\r\n    goto done;\r\n  thing3 = get_next_string ();\/\/makes a copy\r\n  if (more tests to see if all three things fail to add up to a proper command)\r\n    goto done;\r\n  \/\/Yay! The user managed to enter something coherent!\r\n  do_thing (thing1, thing2, thing3);\r\n  done:\r\n  if (thing1)\r\n    free (thing1);\r\n  if (thing2)\r\n    free (thing2);\r\n  if (thing3)\r\n    free (thing3);\r\n}\r\n<\/pre>\n<p>Keep in mind that those if () statements are underselling the complexity at work here. For example, <em>if thing2 ends in .jpg or .png then it is a texture name for sure.  But if not, then it still MIGHT be a texture name, pending the contents of the other things. But those other tests shouldn&#8217;t be performed unless thing2 doesn&#8217;t have an extension. Later, if thing2 doesn&#8217;t have an extension and we figure out it must be a texture name, then we append .jpg.<\/em> And so on. You get the idea. We&#8217;re talking about complex branching logic, done in stages, each of which allocates memory that needs to be released before we move on.  <\/p>\n<p>Now, the C\/C++ orthodoxy is that <em>goto is forbidden. Never shall ye use it, lest ye be subject to ridicule and possibly stoning.<\/em> That&#8217;s mostly true, and even if you do happen to encounter a situation where goto looks like a good solution, most programmers will avoid using it because of social pressure. Using goto in production code is the equivalent of an electrician turning screws with a butterknife. It might get the job done, but it looks unprofessional.<\/p>\n<p>But I could never see any good way to avoid the use of goto in the <tt>parse_thing ()<\/tt> code. Sure, you COULD get rid of it, but just about anything else would require more redundant lines of code, much deeper nesting, and more convoluted logic. Despite what they teach you in school, readability trumps orthodoxy. Most likely the presence of goto here signals that the back end of the parser itself (where it pulls text in) is perhaps built&#8230; oddly. I won&#8217;t diagnose it further to avoid getting into nitty-gritty details and publicly critiquing code written by other people while I was still making tacos for a living. But the point is, this goto was probably a symptom, not a problem.<\/p>\n<p><strong>Hang on. What were we talking about again?<\/strong><\/p>\n<p>Oh! ini files, right. Totally forgot. So yesterday I wrote some code to parse Windows-style ini files. I usually hate parser work. It&#8217;s very fussy and has lots of little pitfalls and hassles and headaches. What if this whitespace is part of the data? Is this file using Windows or Linux style line breaks? What if the user has odd spaces where they shouldn&#8217;t, like inside the [Section] header? What if some of the data has markup in it, like so:<\/p>\n<pre lang=\"ini\">[User]\r\nLoginName=bob@example.com\r\nCharacterName=[[[masta killa187]]]\r\nInvertMouse=0\r\nRememberPassword=0\r\n<\/pre>\n<p>If done wrong, then Bob would break the settings file when he names his character &#8220;[[[masta killa187]]]&#8221;. He arguably deserves it, so maybe you can pass this off as a feature?<\/p>\n<p>Parsers always seem simple at first, but even something as rudimentary as an ini file can have a lot of possible routes for chaos once you allow for the fact that they must contain the most dangerous form of information in computer science: <em>User entered data<\/em>. <\/p>\n<p>Like I said, I usually hate writing parsers. They&#8217;re boring drudge work. But for whatever reason I was in the mood for that kind of work yesterday. So&#8230; that&#8217;s what I did. Seemed to turn out okay. <\/p>\n<p>Did I really just spend 2,000 words rambling on through digressions instead of getting to the point, which wasn&#8217;t interesting anyway? I might have. Anyway. How was your Christmas?<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I spent Christmas day coding. That was fun. As part of my efforts to move to Linux, I decided to port some of my code. One of the first things I&#8217;ll want in the world of Linux is the ability to read .ini files. I really like .ini files. You can put any program settings [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[66],"tags":[],"class_list":["post-18188","post","type-post","status-publish","format-standard","hentry","category-programming"],"_links":{"self":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts\/18188","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=18188"}],"version-history":[{"count":0,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts\/18188\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=18188"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=18188"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=18188"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}