Scraping Part 4: THE FINAL CHAPTER

By Shamus Posted Thursday May 7, 2020

Filed under: Programming 63 comments

My bot is nowWell, not RIGHT now. This series was written after the bot was completed. downloading pages from Metacritic, one at a time, at the rate of a page every couple of seconds. This would be painfully slow if we were trying to read something large-scale, but right now we’re just scraping for PC games that scored above 30 over the last 19 years. That’s well under 1,000 games.

Of course, downloading these pages isn’t useful unless I can pull information out of them. Much earlier in this series I mentioned I’m using the Html Agility pack. This library can parse HTML for me and return the bits I’m interested in.

One of the funny things about this project is that I’m so far out of my comfort zone / area of expertise that I don’t even know what I don’t know. Not only am I likely making lots of hilarious blunders, but I don’t even know that I’m making them.

This is strangely liberating. When I know what I’m doing, then every cut corner makes me feel vaguely guilty. But when you don’t know what you’re doing, you’re free of the obligations to do things the Right Way(tm) because you don’t know what the right way is! As far as I know, I’ve just written the best web scraper in the history of scrapingDespite the lack of proof, I’m fairly confident that I have not actually written the best web scraper in the history of scraping..

Unlike a lot of projects, I’m posting this one after-the-fact, so I can’t take advantage of the advice people are sharing in the comments. The project is done, so your advice is useless without a time machineThis is not to say it’s unwelcome. It’s great to read, it just can’t protect me from screwups I’ve already made.. But I was mildly alarmed when people started warning me about the dangers of using regex to parse HTML. Apparently this is a foolish thing to do, and many programmer hours have been lost to this task.

This is slightly confusing, because I’m sort of doing this and it seems to be working fine.

Just a Regular Expression

REGEX is for pattern matching. But not these kinds of patterns.
REGEX is for pattern matching. But not these kinds of patterns.

A regular expression – regex for short – is a system for finding strings within text. Most of my experience with it comes from the Linux terminal where you might want to do tasks like:

1) List all files that start with “foobar”.

2) Find all text files that contain the word “widget”.

3) Delete all files that begin with a number, followed by “potato”.

Here is a super-simple regex that will match either “serialise” and “serialize”:

seriali[sz]e

Seems pretty simple. It will look for “seriali”, followed by any letter from the set [sz], followed by e. It seems simple and readable here, but these things can get out of hand quickly.

Having said that, I use regex so rarely that I can never remember how it works. It’s really tough to remember how to perform a complex task that only pops up once or twice a decade.

Here is the code I’m using to read from Metacritic:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
//Now we have to sort through Metacritic's scatterbrained HTML.
HtmlDocument html = new HtmlDocument();
html.LoadHtml (page);
HtmlNode node_body = html.DocumentNode.SelectSingleNode("//body");
HtmlNode node_scores = node_body.SelectSingleNode ("//div[@class='score_summary metascore_summary']");
 
//Grab the text inside of this HTML. It SHOULD be a critical score.
HtmlNode node_score_container = node_scores.SelectSingleNode (".//*[@class='metascore_anchor']");
if (node_score_container != null) {
  if (int.TryParse (node_score_container.InnerText, out int possible_score)) {
    //Make sure we grabbed a valid number before we update the database.
    if (possible_score > 0 && possible_score <= 100) {
      g.score_critic = possible_score;
    }
  }
}

On line 2, I tell Html Agility pack (HAp) that I’m creating a new document. From here I could build my own webpage a bit at a time using code, provided I’d just hit my head and forgot the eleven dozen easier ways of creating webpages. However, we’re not here to make a page, but read one. So in line 3 I take the raw text that I’ve already downloaded from Metacritic and give it to HAp.

In line 4 I tell HAp to find me the bit of the document that contains the <body> tag. This will give me everything from <body> to </body>, effectively the entire page minus the header. Then in line 5 I take that body, and I search within it for a <div> tag with a class of “score_summary”.

Looking at the code weeks after writing it, I notice I have a design flaw. Between line 3 and 4, I should check to make sure node_body isn’t NULL. Technically, all valid webpages will always have exactly one <body> tag, so this code is fine for all pages I might encounter from Metacritic. However, there could be some weird edge cases – perhaps the internet flakes out somewhere between my residential connection and Metacritics servers – where I might get a blank page. Such a page would have no body tag. Thus node_body would be null, and thus the program would crash when I try to access it on line 4. Which means that connection problems might crash my program.

Likewise, line 6 doesn’t check to make sure node_scores is valid before using it. This means that the Metacritic designer can crash my program. If they update their site design / CSS and rename the element that contains the score to something else, then my program will crash when it tries to parse the page.

In any case, that bit on line 5 where it says "//div[@class='score_summary metascore_summary']" is a regex. So I’m technically using regex to parse HTML. However, I strongly suspect that people cautioning against using regex are actually cautioning against only regex. There’s a certain temptation to make these massively complex expressions that can perform intricate searches within unpredictable text.For example, this regex will match any numeral:

[+-]?(\d+(\.\d+)?|\.\d+)([eE][+-]?\d+)?

and this one:

^(http|https|ftp):[\/]{2}([a-zA-Z0-9\-\]+\.[a-zA-Z]{2,4})(:[0-9]+)?\/?([a-zA-Z0-9\-\._\?\,\'\/\\\+&amp;%\$#\=~]*)

is actually a dual-purpose expression that will:

  1. Match any valid URL.
  2. Get you punched in the face by the poor sod that has to maintain your code later and figure out why it isn’t working properly. Protip: You’re missing a period just before the second closing bracket.

My guess is that lots of people have tried to construct various too-clever-by-half techniques for sorting through HTML with regex, and wound up making incomprehensible code that doesn’t work properly. I’m reasonably sure that what I’m doing with HAp is allowed. Hap is actually tearing the whole document apart and keeping track of how the various tags are structured. I’m not using regex to parse the HTML. HAp already did that for me. I’m just using regex to tell HAp which bit of the already-parsed document I want.

It’s fine.

It’s probably fine.

It’s mostly probably fine as far as I know.

A Fragile System

Ironic that eggs have become a universal shorthand for fragility, considering that eggs are actually pretty tough when compared to containers of similar mass and thickness.
Ironic that eggs have become a universal shorthand for fragility, considering that eggs are actually pretty tough when compared to containers of similar mass and thickness.

It’s a bit fussy to pull the data out of Metacritic. For example, the number of critic reviews is expressed within plain text. On the Half-Life 2 page, you can see it says “based on 81 Critic Reviews”. What I need to do is find a specific element within the layout, extract that sentence, then step through it a word at a time until I find a word that resolves to a number.

This whole thing is incredibly fragile. Pretty much any change to the Metacritic front end will break it. A major site overhaul would force me to re-write big chunks of code. I’m not sure if real web scrapers use this sort of design-specific targeting to get their data, or if there’s a more flexible / future-proof way of going about this. I dislike having so many things hardcoded like this. I don’t like having site-specific markup (CSS classes like ‘metascore_summary’) embedded in my source code. My first instinct is to build a more generalized parser with some sort of settings file that would be comprehensible to a theoretical end-user. Perhaps some way of expressing to the program, “When you go looking for the user score, look for a DIV with the class name of ‘user_score’.” Then when Metacritic does a major overhaul, you just need to fiddle with a settings file rather than edit the source and redeploy the program.

But to design a system like that, I think I’d need a little more experience with this sort of task. Without first-hand experience trying to harvest data from disparate sites as they evolve over time, my initial design is probably going to be naive. Still, this is something I’d explore if I was going to maintain this program.

Getting More Info

I'm old enough to remember the crazy pre-internet days when you had to physically drive to Google headquarters in Mountain View, CA to get your search results.
I'm old enough to remember the crazy pre-internet days when you had to physically drive to Google headquarters in Mountain View, CA to get your search results.

Once the database is seeded with the basic info, the bot goes on to get information from Wikipedia and Steam. Since Metacritic doesn’t provide links to those places, I have to search for them.

So my bot simply issues a search query to Google and takes the top result. I just do a search using the same name, platform, release year I’ve already collected. For Half-Life 2 the query would be:

“Half-Life 2” 2004 game wikipedia

The word “game” guards against collisions between the game I’m interested in and any same-name movies / comics / shows / food that might exist. The year avoids collisions between same-name sequels that you run into with games like DOOM and Tomb Raider.

In the vast majority of cases, the top search result is what I’m looking for. If it isn’t what I’m looking for, then very likely the game in question doesn’t HAVE a Wikipedia page. And that’s fine. I’ll end up at some random non-wikipedia pageusually a game-specific wiki. that doesn’t contain any of the tags my bot is looking for. However, in a very small number of cases, I’ll run into a situation where:

  1. The top result is not about the game.
  2. The top result IS a Wikipedia page.
  3. The Wikipedia page DOES have a little infobox full of data that matches the kinds of data I’m looking for. For example Publisher, composer, writer, etc.

In these rare cases, the bot ends up harvesting all of that data and putting it into the database. This is why I never bothered sharing any of that information in previous entries. I knew some fraction of them contained garbage data. I toyed around with ways I might double-check that I arrived at the proper Wikipedia page. Maybe test the page title against the name of the game? Maybe look for the information on the release date and make sure it matches? There are a lot of ways you could do this, but I never got around to it.

One humorous note is that apparently Google is really picky about how many searches a bot can do. They don’t publish official numbers on how many requests are permitted, but in my testing it seemed like Google would shut me off after just a couple hundred queries. After that, it would just return code 429 (Too many requests) for all queries. According to Google, the correct way to handle this is to have a cooldown timer that doubles every time you get a 429. So you wait 5 seconds, and then try again. If you get another 429 then you wait 10, then 20, then 40, etc. In practice, it seems like these time-outs would last about an hour and a half.

I tried fiddling with the frequency of requests, but no matter how slowly I made them, I always hit that 1.5 hour time-out after a couple hundred requests. This was the biggest bottleneck the bot had to deal with. Metacritic, Steam, and Wikipedia were all happy to handle a request every few seconds for hours on end, but Google was really stingyWhich is fine. I mean, it’s their service. They’re not obligated to serve bots or anything. It’s just that this was something I had to deal with..

Bing!

This is what happens when domain squatters own all of the English words. We have to name our search engines after cartoon sound effects. I'm looking forward to future tech companies like Wham, Zap, Kapow, and whatever you call the sound a slide whistle makes.
This is what happens when domain squatters own all of the English words. We have to name our search engines after cartoon sound effects. I'm looking forward to future tech companies like Wham, Zap, Kapow, and whatever you call the sound a slide whistle makes.

In the end, I got tired of waiting hours and hours to get all the search results, and I switched to using Bing. Bing uses the exact same query format, very similar search results, and it seems to have no safeguards whatsoever. I was able to make as many queries as I liked.

I found a few cases where I’d wound up harvesting data from a completely unrelated Wikipedia page. Like, maybe Shoot Guy IV would have the Wikipedia info for a documentary about the assassination of JFK. Bing would see the words “guy” and “shoot” buried somewhere in the entry and conclude that this must be the page I’m looking for. (Bing is terrible.) I’m willing to bet most of the erroneous Wikipedia pages were the work of Bing. Again, this is something I would have fixed if I was going to continue working on this project.

So I had my bot preferentially use Google for as long as possible, and then resort to Bing once Google started giving it the silent treatment.

Results

I made a few interesting charts with the resulting data, but I was always a little uneasy about it. I was afraid this would happen:

  1. I post to my blog: “Notice how review scores trend lower for Xbox games than for Playstation games. Maybe this is an artifact of their different release strategies, or maybe it’s indicative of various hardware problems. So here’s 2,000 words of speculation on marketing strategies, hardware comparisons, corporate priorities, and the ways that publishers have used soft bribes to nudge review scores.”
  2. A news site gets wind of it and publishes some clickbait horseshit: “Ex-Gamedev uses math to prove that Xbox is inferior to Playstation!”
  3. The various tribals show up at my site, screeching about how I’ve mistreated or misrepresented their platform. I get accused of being a “Sony Shill”.
  4. Someone looks at my methodology and notices my completely amateur data collection and statistical analysis, and I get dragged over the coals for my shoddy work.

The last one is the only one I really care about. #3 is annoying, but it’s basically part of the job. I’ve still got crazy people howling at me over my Fallout video, and that thing is over 3 months old. The best you can do is wait for them to get bored and leave and try to get the sane ones to stick around.

But #4 would really sting, because I’d be contributing to the overall confusion and ignorance we have going on in this industryI don’t just mean among fans. I mean all the way from fans, to developers, to executives, to gaming media, to non-gaming media.. Posting shoddy analysis is fine if it’s just a small group of us hammering away at the data and trying to extract signal from the noise, but it would be a disaster if that armchair analysis were to escape out into the wider culture.

Wrapping Up

I do find the sawtooth pattern in PC titles to be really interesting. I might shove that into a column / video at some point down the line, with a thick coating of disclaimers that I Am Not a Statistician.

In the end this was an amusing project, but I think it was more useful as a programming exercise than as a data-harvesting tool. And that’s fine. I don’t have the expertiseOr time, really. to make use of the data, but I had a ton of fun programming the dang thing. It was great to work in an environment with so little friction.

 

Footnotes:

[1] Well, not RIGHT now. This series was written after the bot was completed.

[2] Despite the lack of proof, I’m fairly confident that I have not actually written the best web scraper in the history of scraping.

[3] This is not to say it’s unwelcome. It’s great to read, it just can’t protect me from screwups I’ve already made.

[4] usually a game-specific wiki.

[5] Which is fine. I mean, it’s their service. They’re not obligated to serve bots or anything. It’s just that this was something I had to deal with.

[6] I don’t just mean among fans. I mean all the way from fans, to developers, to executives, to gaming media, to non-gaming media.

[7] Or time, really.



From The Archives:
 

63 thoughts on “Scraping Part 4: THE FINAL CHAPTER

  1. Anonymous Coward says:

    Regular expressions can only parse what is called regular languages in the Chomsky hierarchy, which HTML isn’t.

    Which is fancy way of saying: If you try to write a regex that takes some string and tries to answer: “is this valid HTML?”, I will *always* be able to create an example where it gives the wrong answer.

    Notably you’ll have a hard time checking if every opening tag has a corresponding closing tag (if needed).

    (Technically most regex implementation actually expand a bit on regular expressions and give it a bit more power, but if I remember correctly, none enough to parse HTML)

    So yeah, it’s fine for your use.

    1. tmtvl says:

      Well, with subrules it shouldn’t be too difficult.

      rule xml {
      || ‘<‘ \w+ <tag-properties> ‘/>’ # Singular closed tag.
      || <-[<]> # Regular text.
      || ‘<‘ (\w+) <tag-properties> ‘>’ <xml> ‘</’ $0 ‘>’ # Content between an open and close tag.
      }

      1. Kyte says:

        <div class=”outer”><div class=”inner”>Sample text.</div></div>

        Using pure regex, it’s impossible to parse the inner div. You need to add an additional construct, which would be you manually taking the inner xml and parsing it again. This augments your finite state automaton (the regex) with a stack, making it into a push-down automaton, which is what you need to parse a context-free language like well-defined HTML. If the HTML is not well defined? That’s another story.

        1. tmtvl says:

          which would be you manually taking the inner xml and parsing it again

          Yes, that’s what the

          <xml>

          is. It’s a recursive subrule. That’s how you build parsing trees with regular expressions.

          1. Cubic says:

            What he pointed out was that, in that case, it’s not a regular expression anymore (in the original sense of automata theory).

            Various regexp packages can provide whatever functionality they want, of course. I can’t even say if Perl regexps are definitely not Turing complete. If you have recursion, I’d guess they probably are.

            https://en.wikipedia.org/wiki/Regular_expression

            1. tmtvl says:

              Well, as you may be aware, Everything You Know About Regexes is Wrong.

              Also PSIX regular expression syntax is love.

        2. Decius says:

          You parse the inner div the same way you parse the outer div- you look for a , zero or more , an equal number of , and a

          1. Decius says:

            … for a <div>, zero or more (zero or more <div>, an equal number of <.div>), and then a </div>.

            Gotta properly escape that.

    2. SidheKnight says:

      I thought “//div[@class=’score_summary metascore_summary’]” was XPath, not regex. Unless Shamus is using the term “regular expressions” in the broad sense.

      1. EwgB says:

        This does indeed look like XPath. It seems Shamus is using “regular expression” as a synonym for “pattern matching”, as do many programmers without a formal education in automata theory. Not that you need that to be a good programmer to be honest. My knowledge of the difference between deterministic and non-deterministic regular expression engines have been useful to me in my daily work exactly zero times, as have been many of the things I learned in university. Not that I regret learning them, I found many of them very interesting, but they are not very useful for most software engineers who do not go into academia.

        Luckily I also live in a country with free higher education, so my curiosity didn’t cost me a fortune, only (too much) time.

  2. Dreadjaws says:

    The last one is the only one I really care about. #3 is annoying, but it’s basically part of the job. I’ve still got crazy people howling at me over my Fallout video, and that thing is over 3 months old.

    Serves you right for wasting your time with such a bland franchise as Fallout instead of caring exclusively about THEBESTGAMEEVARRRRR!!!!1111, which is, of course, Dark Souls.

    (Bing is terrible.)

    I’m curious as to why the thing is still around. Microsoft tends to abandon crap that no one uses. We’re way past the time were pushing for it would be of any use. Yet, any new installation of Windows still defaults to Bing as search engine until you install a new browser.
    Edit: Oh, wait, nevermind. Apparently it’s the third largest search engine. It says a lot about how the average user just takes whatever comes installed.

    1. tmtvl says:

      It says a lot about how the average user just takes whatever comes installed.

      Well, that’s Microsoft’s entire market strategy. It doesn’t matter how shit their products are, they’re preinstalled so everyone uses them.

    2. Nixorbo says:

      I use Bing because since I started using it I’ve gotten just over $400 in various gift cards from Microsoft Rewards, so I’m sure that dozens of us, maybe even baker’s dozens, use Bing intentionally.

    3. Nimrandir says:

      . . . Dark Souls.

      Now you’ve done it. Three years from now, some random person is going to roll in here and start waxing philosophical about regex invincibility frames and paired scraping, while conspicuously linking to their own website. At least Shamus has his response queued up.

    4. SidheKnight says:

      I’m curious as to why Shamus went from Google to Bing without giving a thought to DuckDuckGo.
      Assuming he didn’t. Maybe he did and he just didn’t mention it, in which case.. why?

      1. Cubic says:

        I’ve read that DuckDuckGo actually uses Bing as the backend (licensed use, even). They just provide a search frontend without the personalization/tracking stuff.

        1. Paul Spooner says:

          It is my understanding that all search engines these days buy their results from Google or Bing. Maybe there’s a third database out there somewhere, but internet search is one of those scaling problems that make it almost impossible to compete unless you’re already established. We’re really very lucky (and I can hardly believe I’m saying this) that Microsoft is such a tech giant with fingers in all the pies, or Google would have a soft monopoly on search results.

      2. Shamus says:

        I used DDG for a few months because I liked the healthy respect for user privacy. But I was constantly put off by how terrible the search results were. Eventually I gave up and went crawling back to Google.

        Having said that: Is DDG better or worse than Bing? I don’t know.

        1. tmtvl says:

          Well, it’s Yahoo, so… it’s the best for Japanese searches, and I’m 99% sure it can tie with Google for technical subjects.

        2. SidheKnight says:

          I’ve been using it for only two weeks or so, but so far I find it almost as good as Google.

          I think the reason why it “fails” sometimes is because it searches “the old fashioned way”. i.e: It searches for the keywords you type based on how often they appear and relevance within the page etc etc, as opposed to Google’s method which tries to guess what you want to search based on the keywords you entered, your searching history, what other people found useful when searching for similar words.. i.e the things that make Google “magical” and user-friendly for non-tech savvy people, it’s as if it could read your mind and guess what you actually wanted to know.

          This can be very useful sometimes (that’s why I forsee I’ll still use Google for some specific searches, though I haven’t missed it yet). but can also be problem, especially when searching for very specific and obscure stuff, Google tends to give me more popular but much less relevant search results just because they’re tangentially related to the stuff I’m actually looking for.

        3. pseudonym says:

          How recent was this duckduckgo usage? I am using duckduckgo as a daily driver now. I usually search for things about programming, because that’s what I do for a living. The results are quite good. I don’t experience any issues. Also for my personal life it is working quite well. It defaults to openstreetmap for map results.
          I find that openstreetmap is much more accurate than google maps in less inhabited areas.
          Having said that: duckduckgo is much better than when I started using it a few years ago. I can recommend a retry if it has been a few years.

          I use google sporadically. Google image search gives nice filters for licenses that allow reuse of images, so I use that sporadically. I also use youtube.

          Unfortunately, google seems to work less and less for me. It seems to prefer sites and sources that I already know. But that annoys me, because that is exactly NOT why I am using a search engine. This is especially annoying when looking for new music on youtube. If you search for a certain performer because you like his renditions of certain pieces, you get all the stuff you have already heard before…

          1. SidheKnight says:

            Glad to hear I didn’t make a mistake in choosing the duck.

            1. pseudonym says:

              Same, when I read your post ;-).

        4. Cubic says:

          I’ve used DDG for a few years now and I’d say … Google is a bit better, but DDG is usually good enough. When the results are not good enough, I try Google.

        5. Zak McKracken says:

          Search results are worse? In which regard?

          Maybe I’m just searching for the wrong things, but I usually get what I’m looking for within the first three results. Also, I really like that they provide those small info boxes, usually with a link to Wikipedia and the official homepage of the thing I was searching for if it exists.

          I kinda miss the disambiguation pages they used to have. Wonder why they removed them, because that was one huge advantage over Google.

          That said: If you’re looking for a wikipedia page, why not use the Wikipedia search? Was that somehow more difficult to figure out?

  3. Abnaxis says:

    Maybe this is a weird thing to ask, but why not use the search function on Wikipedia or Steam instead of using Google it Bing? You get more “help” making sure you get the right results that way (e.g. Wikipedia usually serves you a “did you mean..? page that you can brute force to find the video games instead of movies/food/etc)

    Scraping through web pages is one of those things that I wind up needing to do briefly every 3-5 years or so so I wouldn’t call myself an expert, but I would never touch Google and/or Bing with a 10 foot pole

    1. Bubble181 says:

      I was wondering the same thing.

  4. Groboclown says:

    One of my favorite programming jokes: “I had a problem, so I used regular expressions. Now I have two problems.”

  5. Jin says:

    Those things you highlight as regexes in your code aren’t; they are XPath selectors, which at first glance have similar amounts of visual noise to regular expressions, but completely different semantics. They are intended to be used exactly as you are using them.

    1. Echo Tango says:

      I was going to post the same thing. For extra clarification, XPaths are basically searching through nodes of HTML (or XML[1]), but regex is searching through raw text strings. So, in the context of this scraper for example, an XPath would let you find a paragraph tag, inside of an unknown number of other tags, inside of a tag with class=”gameData” or whatever. The regexes would be looking for matching angle-brackets, quotes, tag-names, etc, just trying to parse the HTML in the first place! :)

      [1] I forget how these different things relate exactly. I watched the Computerphile video on the different markup languages, specs, etc like, twice, understood it, and then had it leave my brain because it doesn’t matter for most things.

      1. Retsam says:

        The main difference between HTML and XML is that XML is a language for structuring arbitrary data, while HTML is specifically a XML-like language specifically for specifying websites.

        XML can be used for arbitrary data, such as this data that I’ve just made up:

        <site>
            <users>
                <user name="Shamus" id="0" admin="true" />
                <user name="Retsam" id="19" />
                <user name="Echo Tango" id="42" />
            </user>
            <comments>
                <comment id="0">[This comment]</comment>
            </comments>
        </site>

        I could find myself with an XPath selector like /site/users/[@name="Retsam"].

        Or there are specific languages built-on XML, such as SVG, which uses XML data to draw vector graphics.


        <svg height="100" width="100">
            <circle cx="50" cy="50" r="40" stroke="black" stroke-width="3" fill="red" />
        </svg>

        This draws a red circle inside a 100 by 100 image.

        To spit hairs, though HTML is, technically, not XML. It’s very similar, but breaks some rules of XML (e.g. in XML alll tags must either have a closing tag (<tag></tag>) or else be self-closing: (<tag />) But in HTML this isn’t true.

        There is a version of HTML called XHTML which is a true XML-based language. But I don’t think it’s particularly widely used. (There’s a lot of surprising complexity around HTML history and versions…)

        1. Viruzzo says:

          Going into even more irrelevant detail: older HTML is an extension of SGML, as XML is, so XHTML is basically bringing together two siblings.
          HTML 5 on the other hand is *not* based on even SGML anymore, but is its own thing, though still very similar since the lineage is still there.

      2. tmtvl says:

        I don’t remember my XPath too well, but isn’t xslt the superior option?

    2. Retsam says:

      Yeah, XPath selectors and CSS selectors, are pretty much the de-facto tools for working with HTML. (CSS selectors, which is the syntax used to specify parts of a website to apply CSS styling, are simpler and IMO, more readable, but a bit more limited than XPath)

      It’s a similar looking syntax to regular expressions (again, I don’t like XPath because it’s often as hard to read as Regex), but the difference is that XPath is executed by some engine which has a full understanding of how to parse HTML. The API that you’re using does all the parsing of the HTML into a tree-like structure, then the XPath is just a set of instructions for navigating that tree-like structure.

      Whereas Regular Expressions don’t know anything about HTML – they only deal with raw strings – and they just aren’t sufficiently powerful to do anything more than the simplest of tasks with HTML, and will quickly lead down a path of insanity.

  6. Groboclown says:

    On a side note, the HTML document navigation language is a form of XPath, which uses the already-parsed, broken-into-blocks, html. You’re using correctly parsed html, so you avoided the biggest pitfalls of html-regex, like incorrectly finding attributes or end tags. XPath still has issues of hard coding the html layout, so you’re right about the fragility.

    Edit: @Jin spotted it first.

  7. Darker says:

    That parameter to SelectSingleNode on line 5 is XPath, not regex, so you did the right thing after all.

  8. Mark Erikson says:

    I’m sure you’ve run across this already, but just in case anyone hasn’t seen it:

    the canonical response to “can you parse HTML with a regex?” is this SO answer that starts off with a “no”, and slowly descends into madness:

    https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

  9. Echo Tango says:

    usually a game-specific wiki

    This is one of the more annoying things with Google – “wiki” is a synonym for “wikipedia”. Probably because many people don’t know the difference, so they use them interchangeably, and mess up the machine-learning for the rest of us.

    1. Thomas says:

      Google used to be good for power users. Now it keeps second-guessing us.

      Writing site:en.wikipedia.org instead of just wikipedia would have helped here.

      1. Echo Tango says:

        Yup! Hopefully they don’t remove stuff like that. They already made it almost impossible to search for short pieces of text, or “punctuation” (aka, all of the brackets and fiddly bits in computer code), because they changed how quoted text works about a decade ago[1]. Also, they removed the ability (as far as I know) to do ANDs, ORs, or NOTs in your searches. According to this help page, you can only do ORs now. :S

        [1] Now, quoted text means “this text must always exist”, which is kind of like forcing specific strings to appear on the page, but it ignores punctuation inside of the quotes. :C

  10. Ingvar says:

    As one of the people warning against parsing HTML with regular expressions, I may also be falling foul of using domain-specific terms.

    As far as I can tell, what you are doing is “lexing HTML” with regular expressions, which is fine. You use this to find delimiters, that you then turn into structure. The bit that turns it from “a sequence of lexical tokens” to “a structure”? That’s the parsing.

    Lexing with regular expressions is what I would normally recommend. Lexical tokens are (in most cases) expressible and recognisable with a regular language[1], whereas things like HTML and other things tend to require at least context-free, if not even more powerful abstractions.

    So, when people say “don’t use regular expressions to parse HTML”, they mean that you should not be constructing a single regular expression, do a single “match” against the document and hope that the bits you care about are retrievable from match-groups in the match result. Because that’s what 59.73% (number made up, probably low-balled) of the people who say “I tried parsing HTML with a regular expression and now I’m sad” actually tried.

    1. Amanda says:

      I think you forgot to write the footnote

      1. Ingvar says:

        I am pretty sure you are correct.

        [1] A regular language is exactly a language that can be recognised by a finite automaton. The general limitation with a FA (NFA, DFA, they’re equally expressive, but one trades “time” for “space”) is that they cannot count, so it is, in the general case, impossible to write a regular expression that does something as simple as “is this sequence of open/close parentheses balanced” It is possible, if tedious, to write one that checks to a specific nesting depth.

  11. Christopher Dwight Wolf says:

    Did you really mean

    we’re just scraping for PC games that scored above 30 over the last 19 years. That’s well under 1,000 games.

    It seems like most PC games score over 30 and that is not really the mark of a good game, just not an entirely broken one.

    1. Erik says:

      He did mean exactly that, as explained in an earlier part. (Part 1, I think, where he was describing the data.) It was a way of pruning a certain class of uninteresting data from the set.

      1. Christopher Dwight Wolf says:

        Thanks, I forgot about that part.

  12. kikito says:

    Oh. Hey!

    I was the one (or one of the ones) who mentioned using regexes for html and “many programmer hours have been lost to this task”.

    As others are saying what you have there aren’t regexes, but XPath expressions.

    Here’s the docs for selectSingleNode, which confirm this: https://html-agility-pack.net/select-single-node

    So you are fine.

    I’m glad I put some fear in your heart, though. You produce better content when under stress :).

    Stay safe!

  13. pseudonym says:

    I dislike having so many things hardcoded like this. I don’t like having site-specific markup (CSS classes like ‘metascore_summary’) embedded in my source code. My first instinct is to build a more generalized …

    But you did not fall for that age-old trap! Instead you got actual work done!
    https:/xkcd.com/974

    For one-off projects I’d argue that this is the right way ™.

    1. Retsam says:

      Yeah, I think it’s just the “right way to do it” full-stop.

      You could pull the paths into a config file for ease of changing later, I guess; but any sort of generalized for scheme of how to deal with future website changes is likely going to rely on untenable assumptions of how the website might change in the future.

      We deal with this sort of thing when testing websites while developing them: testing a website is a lot like using a scraper to interact with it. And you basically do just have to hardcode selectors and change them when the site changes. You can mitigate the fragility with how you design the website, (labeling significant parts of the document with consistent class names, e.g.) but obviously most websites aren’t designing with “convenience to scrapers” in mind.

    2. Echo Tango says:

      Yup! You almost never know how all of the use-cases will actually work, before you’ve written a few of them. It’s always better to write the use-cases by themselves, and then figure out how to generalize after that. :)

      1. Paul Spooner says:

        My thoughts exactly.
        Unless you’re just having fun.
        Omnisolutions!

      2. Nimrandir says:

        Coming from a discipline known for generalizing things beyond all recognition, I’d agree with working from the specific cases first. We like to have several data points before we go looking for the underlying pattern.

  14. Lars says:

    Typolice: “then every cut corner makes me feel vaguely guity.”
    Missing an L in guilty.

    1. Paul Spooner says:

      You now owe an apology to the ~900 people named “Guity

  15. Cubic says:

    Speaking of the Beast of Mountain View, I used Google before there was a google.com.

  16. Decius says:

    For example, this regex will match any numeral:

    [+-]?(\d+(\.\d+)?|\.\d+)([eE][+-]?\d+)?

    Doesn’t match “.5”, “1,000,000.5”, “1.000.000,5” “1 00000” or “IV”. That’s a far cry from “any numeral”.

    It doesn’t validate, so “0e0” matches; depending on the specific use case, that might need to be considered a number though.

    1. tmtvl says:

      Hang on, is that ‘IV’ or ‘?’? (That second one ought to be Roman Numeral Four, if the unicode manages to get through.)

  17. Decius says:

    Does wikipedia’s search feature not consistently return the page that you want, if it exists?

  18. Chad says:

    In case you get interested in the programming process again…

    The way you’d handle these issues with a “real” web scraper would be to keep track of the data/sources as you go, rather than processing it in-place and only tracking the results. For example, your searcher would record the URL that it got alongside the info it extracted from the page. Many also record the page contents themselves, since storage is cheap for large projects. For your example of parsing out the number of reviews, you’d keep the sentences that you searched along with the number that you extracted. You’d keep these in some sort of data store, details varied based on your size/speed/update requirements, it for your needs, any sort of database will do (I’d suggest SQLite because it’s free, easy, available, and good enough for your needs).

    This lets you do a few cool things: you can break up your tasks over multiple programs, which lets you parallelize. This isn’t crucial for you but is helpful for many even simple spiders. It also lets you adjust for problems and update your analysis without requiring that you throw away everything and start over. For example, say that you discover that 1 out of 5 reviewers started using “user_rating” instead of “user_score”, or that the scores reported by a particular reviewer moved from a 0-100 scale to a 0-10 scale partway through 2018 (these specific examples don’t exactly fit your model, but they convey the idea). If you still have all of the data stored, you can update your code and re-run the analysis step, without waiting for search engine timeouts, re-downloading everything, etc.
    Thirdly, it lets you do comparisons over time in a way that’s not hidden by revisionist history, outages, shutdowns, and the like.

    Fwiw, the biggest problem systems like this tend to have in practice is managing updates and hand-offs; that is, knowing when parser can start working on the output of crawler, and when analysis can work on the output of parser. This comes up when people, want to keep the data “live” within a narrow window. You should be able to ignore this problem almost entirely.

  19. MadTinkerer says:

    right now we’re just scraping for PC games that scored above 30 over the last 19 years. That’s well under 1,000 games.

    1000 / 19 = 52.6 (rounded) So that’s less than one game a week? That’s far less than the total number of releases on Steam each year, for starters, and significantly less than the total games on every platform. I know it’s impossible to keep up with the “fire hose” of games released on Steam every day, and keep a full time job, but that’s not even a game a week!

    This is why curators like Steam 250 are so important. It’s not nearly everything either, but at least you can get the absolute cream of Indie games that manage to get enough reviews for an official rating.

  20. Ninety-Three says:

    I’m not sure if real web scrapers use this sort of design-specific targeting to get their data, or if there’s a more flexible / future-proof way of going about this. I dislike having so many things hardcoded like this.

    I work in this field, and there kind of is. There’s no getting around the fact that somewhere your program has to be hardcoded with “pull the number out from ‘based on 81 Critic Reviews'” and it’ll break when they redesign that part of the page, but it’s popular to follow a design pattern where all of that fragile hardcoded stuff gets put into a document called the page map, and then the program invokes more generic functions that are fed page-specific data from the map. It’s nice for maintainability because it forces you to neatly break out every piece of logic such that it’s obvious what broke when the page updated, but the main reason to do it is that you’re not just scraping a bunch of identical Metacritic pages: if you want to build a tool for multiple sites, you get to reuse a lot of high-level code between sites and just create separate page maps which tell it “On Metacritic you want to look for ‘based on 81 Critic Reviews’ and on Rotten Tomatoes you want to look for ‘Total Count: 48‘.”

  21. Vamphri says:

    As a Test Engineer who has made a career in poking holes in other peoples test designs, the only thing I would question is the validity of the rating data prior to ~2006-2008. I am not convinced that the data there was created in parallel with the release of the game. Everything else seems fine, and since you released the raw data, if someone had an issue with data validity then they could just remove that portion of the data and re-run some analysis. I would not worry about producing false information since the graphs and “processed data” that you published dosnt really touch any of that.
    Great work and a good read!

Thanks for joining the discussion. Be nice, don't post angry, and enjoy yourself. This is supposed to be fun. Your email address will not be published. Required fields are marked*

You can enclose spoilers in <strike> tags like so:
<strike>Darth Vader is Luke's father!</strike>

You can make things italics like this:
Can you imagine having Darth Vader as your <i>father</i>?

You can make things bold like this:
I'm <b>very</b> glad Darth Vader isn't my father.

You can make links like this:
I'm reading about <a href="http://en.wikipedia.org/wiki/Darth_Vader">Darth Vader</a> on Wikipedia!

You can quote someone like this:
Darth Vader said <blockquote>Luke, I am your father.</blockquote>

Leave a Reply to Thomas Cancel reply

Your email address will not be published.