So now I’m done messing around and being silly. It’s time to actually scrape the web for stuff. There are three different sites I’m interested in:
- Metacritic, for critic scores.
- Wikipedia, for credits regarding director, writer, producer, composer, etc. This information is spotty and I can’t think of how it might be useful right now, but I’m going to include it as part of the exercise. Also, Wikipedia often notes what franchise a game is from, which might be handy if I want to do a search that includes “all Resident Evil games” or somesuch.
- Steam, for PC -specific info like DRM, controller support, multiplayer, etc.
There’s also a bit of information that can come from any of these sources: The url for the game’s official website might be handy, and we also need to get the publisher, developer, and release date from one of these places.
Of the three sites, it seems like Metacritic is the best one to start with. It has games listed by platform, which is necessary in a structural sense. For the purposes of our database, it’s possible for the same game to have vastly different information depending on platform. For example, maybe a game is released on the Playstation 3 in 2010 by Beloved Developer, but then a year later it gets ported to the PC by Shovelware Games. Metacritic is the only place where we can get this information reliably. Steam obviously isn’t going to have non-PC data, and Wikipedia entries aren’t guaranteed to have all the per-platform data in an easy-to capture locationIt might be in the info box on the right, or it might be buried in the article text (good luck capturing THAT) or it might not be listed at all..
Metacritic even has a handy index page that you can go through: Continue reading 〉〉 “Scraping Part 3: A Well-Behaved Bot”
Shamus Young is a programmer, an author, and nearly a composer. He works on this site full time. If you'd like to support him, you can do so via Patreon or PayPal.