{"id":49922,"date":"2020-05-07T06:00:29","date_gmt":"2020-05-07T10:00:29","guid":{"rendered":"https:\/\/www.shamusyoung.com\/twentysidedtale\/?p=49922"},"modified":"2020-05-08T04:23:02","modified_gmt":"2020-05-08T08:23:02","slug":"scraping-part-4-parsing-html","status":"publish","type":"post","link":"https:\/\/www.shamusyoung.com\/twentysidedtale\/?p=49922","title":{"rendered":"Scraping Part 4: THE FINAL CHAPTER"},"content":{"rendered":"<p>My bot is now<span class='snote' title='1'>Well, not RIGHT now. This series was written after the bot was completed.<\/span> downloading pages from Metacritic, one at a time, at the rate of a page every couple of seconds. This would be painfully slow if we were trying to read something large-scale, but right now we&#8217;re just scraping for PC games that scored above 30 over the last 19 years. That&#8217;s well under 1,000 games.<\/p>\n<p>Of course, downloading these pages isn&#8217;t useful unless I can pull information out of them. Much earlier in this series I mentioned I&#8217;m using the <a href=\"https:\/\/html-agility-pack.net\/\">Html Agility pack<\/a>. This library can parse HTML for me and return the bits I&#8217;m interested in.<\/p>\n<p>One of the funny things about this project is that I&#8217;m so far out of my comfort zone \/ area of expertise that I don&#8217;t even know what I don&#8217;t know. Not only am I likely making lots of hilarious blunders, but I don&#8217;t even know that I&#8217;m making them.<\/p>\n<p>This is strangely liberating. When I know what I&#8217;m doing, then every cut corner makes me feel vaguely guilty. But when you don&#8217;t know what you&#8217;re doing, you&#8217;re free of the obligations to do things the Right Way(tm) because you don&#8217;t know what the right way is! As far as I know, I&#8217;ve just written the best web scraper in the history of scraping<span class='snote' title='2'>Despite the lack of proof, I&#8217;m fairly confident that I have not actually written the best web scraper in the history of scraping.<\/span>.<\/p>\n<p><!--more--><\/p>\n<p>Unlike a lot of projects, I&#8217;m posting this one after-the-fact, so I can&#8217;t take advantage of the advice people are sharing in the comments. The project is done, so your advice is useless without a time machine<span class='snote' title='3'>This is not to say it&#8217;s unwelcome. It&#8217;s great to read, it just can&#8217;t protect me from screwups I&#8217;ve already made.<\/span>. But I was mildly alarmed when people started warning me about the dangers of using regex to parse HTML. Apparently this is a foolish thing to do, and many programmer hours have been lost to this task.<\/p>\n<p>This is slightly confusing, because I&#8217;m sort of doing this and it seems to be working fine.<\/p>\n<h3>Just a Regular Expression<\/h3>\n<p><div class='imagefull'><img src='https:\/\/www.shamusyoung.com\/twentysidedtale\/images\/stock_patterns.jpg' width=100% alt='REGEX is for pattern matching. But not these kinds of patterns.' title='REGEX is for pattern matching. But not these kinds of patterns.'\/><\/div><div class='mouseover-alt'>REGEX is for pattern matching. But not these kinds of patterns.<\/div><\/p>\n<p>A regular expression &#8211; regex for short &#8211; is a system for finding strings within text. Most of my experience with it comes from the Linux terminal where you might want to do tasks like:<\/p>\n<p>1) List all files that start with &#8220;foobar&#8221;.<\/p>\n<p>2) Find all text files that contain the word &#8220;widget&#8221;.<\/p>\n<p>3) Delete all files that begin with a number, followed by &#8220;potato&#8221;.<\/p>\n<p>Here is a super-simple regex that will match either &#8220;serialise&#8221; and &#8220;serialize&#8221;:<\/p>\n<p><code>seriali[sz]e<\/code><\/p>\n<p>Seems pretty simple. It will look for &#8220;seriali&#8221;, followed by any letter from the set [sz], followed by e. It seems simple and readable here, but these things can get out of hand quickly.<\/p>\n<p>Having said that, I use regex so rarely that I can never remember how it works. It&#8217;s really tough to remember how to perform a complex task that only pops up once or twice a decade.<\/p>\n<p>Here is the code I&#8217;m using to read from Metacritic:<\/p>\n<pre lang=\"csharp\" line=\"1\">\r\n\/\/Now we have to sort through Metacritic's scatterbrained HTML.\r\nHtmlDocument html = new HtmlDocument();\r\nhtml.LoadHtml (page);\r\nHtmlNode node_body = html.DocumentNode.SelectSingleNode(\"\/\/body\");\r\nHtmlNode node_scores = node_body.SelectSingleNode (\"\/\/div[@class='score_summary metascore_summary']\");\r\n\r\n\/\/Grab the text inside of this HTML. It SHOULD be a critical score.\r\nHtmlNode node_score_container = node_scores.SelectSingleNode (\".\/\/*[@class='metascore_anchor']\");\r\nif (node_score_container != null) {\r\n  if (int.TryParse (node_score_container.InnerText, out int possible_score)) {\r\n    \/\/Make sure we grabbed a valid number before we update the database.\r\n    if (possible_score > 0 && possible_score <= 100) {\r\n      g.score_critic = possible_score;\r\n    }\r\n  }\r\n}<\/pre>\n<p>On line 2, I tell <a href=\"https:\/\/html-agility-pack.net\/\">Html Agility pack<\/a> (HAp) that I'm creating a new document. From here I could build my own webpage a bit at a time using code, provided I'd just hit my head and forgot the eleven dozen easier ways of creating webpages. However, we're not here to make a page, but read one. So in line 3 I take the raw text that I've already downloaded from Metacritic and give it to HAp.<\/p>\n<p>In line 4 I tell HAp to find me the bit of the document that contains the &lt;body&gt; tag. This will give me everything from &lt;body&gt; to &lt;\/body&gt;, effectively the entire page minus the header. Then in line 5 I take that body, and I search within it for a &lt;div&gt; tag with a class of \"score_summary\".<\/p>\n<p>Looking at the code weeks after writing it, I notice I have a design flaw. Between line 3 and 4, I should check to make sure node_body isn't NULL. Technically, all valid webpages will always have exactly one &lt;body&gt; tag, so this code is fine for all pages I might encounter from Metacritic. However, there could be some weird edge cases - perhaps the internet flakes out somewhere between my residential connection and Metacritics servers - where I might get a blank page. Such a page would have no body tag. Thus node_body would be null, and thus the program would crash when I try to access it on line 4. Which means that connection problems <b>might<\/b> crash my program.<\/p>\n<p>Likewise, line 6 doesn't check to make sure node_scores is valid before using it. This means that the Metacritic designer can crash my program. If they update their site design \/ CSS and rename the element that contains the score to something else, then my program will crash when it tries to parse the page.<\/p>\n<p>In any case, that bit on line 5 where it says <code>\"\/\/div[@class='score_summary metascore_summary']\"<\/code> is a regex. So I'm <em>technically<\/em> using regex to parse HTML. However, I strongly suspect that people cautioning against using regex are actually cautioning against <b>only<\/b> regex. There's a certain temptation to make these massively complex expressions that can perform intricate searches within unpredictable text.For example, this regex will match any numeral:<\/p>\n<p><code>[+-]?(\\d+(\\.\\d+)?|\\.\\d+)([eE][+-]?\\d+)?<\/code><\/p>\n<p>and this one:<\/p>\n<p><code>^(http|https|ftp):[\\\/]{2}([a-zA-Z0-9\\-\\]+\\.[a-zA-Z]{2,4})(:[0-9]+)?\\\/?([a-zA-Z0-9\\-\\._\\?\\,\\'\\\/\\\\\\+&amp;amp;%\\$#\\=~]*)<\/code><\/p>\n<p>is actually a dual-purpose expression that will:<\/p>\n<ol>\n<li>Match any valid URL.<\/li>\n<li>Get you punched in the face by the poor sod that has to maintain your code later and figure out why it isn't working properly. Protip: You're missing a period just before the second closing bracket.<\/li>\n<\/ol>\n<p>My guess is that lots of people have tried to construct various too-clever-by-half techniques for sorting through HTML with regex, and wound up making incomprehensible code that doesn't work properly. I'm reasonably sure that what I'm doing with HAp is allowed. Hap is actually tearing the whole document apart and keeping track of how the various tags are structured. I'm not using regex to parse the HTML. HAp already did that for me. I'm just using regex to tell HAp which bit of the already-parsed document I want.<\/p>\n<p>It's fine.<\/p>\n<p>It's probably fine.<\/p>\n<p>It's mostly probably fine as far as I know.<\/p>\n<h3>A Fragile System<\/h3>\n<p><div class='imagefull'><img src='https:\/\/www.shamusyoung.com\/twentysidedtale\/images\/stock_eggshells.jpg?' width=100% alt='Ironic that eggs have become a universal shorthand for fragility, considering that eggs are actually pretty tough when compared to containers of similar mass and thickness.' title='Ironic that eggs have become a universal shorthand for fragility, considering that eggs are actually pretty tough when compared to containers of similar mass and thickness.'\/><\/div><div class='mouseover-alt'>Ironic that eggs have become a universal shorthand for fragility, considering that eggs are actually pretty tough when compared to containers of similar mass and thickness.<\/div><\/p>\n<p>It's a bit fussy to pull the data out of Metacritic. For example, the number of critic reviews is expressed within plain text. On <a href=\"https:\/\/www.metacritic.com\/game\/pc\/half-life-2\">the Half-Life 2 page<\/a>, you can see it says \"based on 81 Critic Reviews\". What I need to do is find a specific element within the layout, extract that sentence, then step through it a word at a time until I find a word that resolves to a number.<\/p>\n<p>This whole thing is incredibly fragile. Pretty much any change to the Metacritic front end will break it. A major site overhaul would force me to re-write big chunks of code. I'm not sure if real web scrapers use this sort of design-specific targeting to get their data, or if there's a more flexible \/ future-proof way of going about this. I dislike having so many things hardcoded like this. I don't like having site-specific markup (CSS classes like 'metascore_summary') embedded in my source code. My first instinct is to build a more generalized parser with some sort of settings file that would be comprehensible to a theoretical end-user. Perhaps some way of expressing to the program, \"When you go looking for the user score, look for a DIV with the class name of 'user_score'.\" Then when Metacritic does a major overhaul, you just need to fiddle with a settings file rather than edit the source and redeploy the program.<\/p>\n<p>But to design a system like that, I think I'd need a little more experience with this sort of task. Without first-hand experience trying to harvest data from disparate sites as they evolve over time, my initial design is probably going to be naive. Still, this is something I'd explore if I was going to maintain this program.<\/p>\n<h3>Getting More Info<\/h3>\n<p><div class='imagefull'><img src='https:\/\/www.shamusyoung.com\/twentysidedtale\/images\/stock_google.jpg' width=100% alt='I&apos;m old enough to remember the crazy pre-internet days when you had to physically drive to Google headquarters in Mountain View, CA to get your search results.' title='I&apos;m old enough to remember the crazy pre-internet days when you had to physically drive to Google headquarters in Mountain View, CA to get your search results.'\/><\/div><div class='mouseover-alt'>I&apos;m old enough to remember the crazy pre-internet days when you had to physically drive to Google headquarters in Mountain View, CA to get your search results.<\/div><\/p>\n<p>Once the database is seeded with the basic info, the bot goes on to get information from Wikipedia and Steam. Since Metacritic doesn't provide links to those places, I have to search for them.<\/p>\n<p>So my bot simply issues a search query to Google and takes the top result. I just do a search using the same name, platform, release year I've already collected. For <em>Half-Life 2<\/em> the query would be:<\/p>\n<p>\"Half-Life 2\" 2004 game wikipedia<\/p>\n<p>The word \"game\" guards against collisions between the game I'm interested in and any same-name movies \/ comics \/ shows \/ food that might exist. The year avoids collisions between same-name sequels that you run into with games like <i>DOOM<\/i> and <i>Tomb Raider<\/i>.<\/p>\n<p>In the vast majority of cases, the top search result is what I'm looking for. If it isn't what I'm looking for, then very likely the game in question doesn't HAVE a Wikipedia page. And that's fine. I'll end up at some random non-wikipedia page<span class='snote' title='4'>usually a game-specific wiki.<\/span> that doesn't contain any of the tags my bot is looking for. However, in a very small number of cases, I'll run into a situation where:<\/p>\n<ol>\n<li>The top result is not about the game.<\/li>\n<li>The top result IS a Wikipedia page.<\/li>\n<li>The Wikipedia page DOES have a little infobox full of data that matches the kinds of data I'm looking for. For example Publisher, composer, writer, etc.<\/li>\n<\/ol>\n<p>In these rare cases, the bot ends up harvesting all of that data and putting it into the database. This is why I never bothered sharing any of that information in previous entries. I knew some fraction of them contained garbage data. I toyed around with ways I might double-check that I arrived at the proper Wikipedia page. Maybe test the page title against the name of the game? Maybe look for the information on the release date and make sure it matches? There are a lot of ways you could do this, but I never got around to it.<\/p>\n<p>One humorous note is that apparently Google is really picky about how many searches a bot can do. They don't publish official numbers on how many requests are permitted, but in my testing it seemed like Google would shut me off after just a couple hundred queries. After that, it would just return code 429 (Too many requests) for all queries. According to Google, the correct way to handle this is to have a cooldown timer that doubles every time you get a 429. So you wait 5 seconds, and then try again. If you get another 429 then you wait 10, then 20, then 40, etc. In practice, it seems like these time-outs would last about an hour and a half.<\/p>\n<p>I tried fiddling with the frequency of requests, but no matter how slowly I made them, I always hit that 1.5 hour time-out after a couple hundred requests. This was the biggest bottleneck the bot had to deal with. Metacritic, Steam, and Wikipedia were all happy to handle a request every few seconds for hours on end, but Google was <b>really<\/b> stingy<span class='snote' title='5'>Which is fine. I mean, it's their service. They're not obligated to serve bots or anything. It's just that this was something I had to deal with.<\/span>.<\/p>\n<h3>Bing!<\/h3>\n<p><div class='imagefull'><img src='https:\/\/www.shamusyoung.com\/twentysidedtale\/images\/bing.jpg' width=100% alt='This is what happens when domain squatters own all of the English words. We have to name our search engines after cartoon sound effects. I&apos;m looking forward to future tech companies like Wham, Zap, Kapow, and whatever you call the sound a slide whistle makes.' title='This is what happens when domain squatters own all of the English words. We have to name our search engines after cartoon sound effects. I&apos;m looking forward to future tech companies like Wham, Zap, Kapow, and whatever you call the sound a slide whistle makes.'\/><\/div><div class='mouseover-alt'>This is what happens when domain squatters own all of the English words. We have to name our search engines after cartoon sound effects. I&apos;m looking forward to future tech companies like Wham, Zap, Kapow, and whatever you call the sound a slide whistle makes.<\/div><\/p>\n<p>In the end, I got tired of waiting hours and hours to get all the search results, and I switched to using Bing. Bing uses the exact same query format, very similar search results, and it seems to have no safeguards whatsoever. I was able to make as many queries as I liked.<\/p>\n<p>I found a few cases where I'd wound up harvesting data from a completely unrelated Wikipedia page. Like, maybe Shoot Guy IV would have the Wikipedia info for a documentary about the assassination of JFK. Bing would see the words \"guy\" and \"shoot\" buried somewhere in the entry and conclude that this must be the page I'm looking for. (Bing is terrible.) I'm willing to bet most of the erroneous Wikipedia pages were the work of Bing. Again, this is something I would have fixed if I was going to continue working on this project.<\/p>\n<p>So I had my bot preferentially use Google for as long as possible, and then resort to Bing once Google started giving it the silent treatment.<\/p>\n<h3>Results<\/h3>\n<p>I made <a href=\"?p=49237\">a few interesting charts<\/a> with the resulting data, but I was always a little uneasy about it. I was afraid this would happen:<\/p>\n<ol>\n<li>I post to my blog: \"Notice how review scores trend lower for Xbox games than for Playstation games. Maybe this is an artifact of their different release strategies, or maybe it's indicative of various hardware problems. So here's 2,000 words of speculation on marketing strategies, hardware comparisons, corporate priorities, and the ways that publishers have used soft bribes to nudge review scores.\"<\/li>\n<li>A news site gets wind of it and publishes some clickbait horseshit: \"Ex-Gamedev uses math to prove that Xbox is inferior to Playstation!\"<\/li>\n<li>The various tribals show up at my site, screeching about how I've mistreated or misrepresented their platform. I get accused of being a \"Sony Shill\".<\/li>\n<li>Someone looks at my methodology and notices my completely amateur data collection and statistical analysis, and I get dragged over the coals for my shoddy work.<\/li>\n<\/ol>\n<p>The last one is the only one I really care about. #3 is annoying, but it's basically part of the job. I've <b>still<\/b> got crazy people howling at me over <a href=\"https:\/\/www.youtube.com\/watch?v=M8U4k2Ik6yk\">my Fallout video<\/a>, and that thing is over 3 months old. The best you can do is wait for them to get bored and leave and try to get the sane ones to stick around.<\/p>\n<p>But #4 would really sting, because I'd be contributing to the overall confusion and ignorance we have going on in this industry<span class='snote' title='6'>I don't just mean among fans. I mean all the way from fans, to developers, to executives, to gaming media, to non-gaming media.<\/span>. Posting shoddy analysis is fine if it's just a small group of us hammering away at the data and trying to extract signal from the noise, but it would be a disaster if that armchair analysis were to escape out into the wider culture.<\/p>\n<h3>Wrapping Up<\/h3>\n<p>I do find <a href=\"?p=49109\">the sawtooth pattern<\/a> in PC titles to be really interesting. I might shove that into a column \/ video at some point down the line, with a thick coating of disclaimers that I Am Not a Statistician.<\/p>\n<p>In the end this was an amusing project, but I think it was more useful as a programming exercise than as a data-harvesting tool. And that's fine. I don't have the expertise<span class='snote' title='7'>Or time, really.<\/span> to make use of the data, but I had a ton of fun programming the dang thing. It was great to work in an environment with so little friction.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>My bot is nowWell, not RIGHT now. This series was written after the bot was completed. downloading pages from Metacritic, one at a time, at the rate of a page every couple of seconds. This would be painfully slow if we were trying to read something large-scale, but right now we&#8217;re just scraping for PC [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[66],"tags":[],"class_list":["post-49922","post","type-post","status-publish","format-standard","hentry","category-programming"],"_links":{"self":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts\/49922","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=49922"}],"version-history":[{"count":30,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts\/49922\/revisions"}],"predecessor-version":[{"id":49953,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts\/49922\/revisions\/49953"}],"wp:attachment":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=49922"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=49922"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=49922"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}