{"id":49866,"date":"2020-04-30T06:00:29","date_gmt":"2020-04-30T10:00:29","guid":{"rendered":"https:\/\/www.shamusyoung.com\/twentysidedtale\/?p=49866"},"modified":"2020-04-30T03:58:35","modified_gmt":"2020-04-30T07:58:35","slug":"scraping-part-3-a-well-behaved-bot","status":"publish","type":"post","link":"https:\/\/www.shamusyoung.com\/twentysidedtale\/?p=49866","title":{"rendered":"Scraping Part 3: A Well-Behaved Bot"},"content":{"rendered":"<p>So now I&#8217;m done messing around and being silly. It&#8217;s time to actually scrape the web for stuff. There are three different sites I&#8217;m interested in:<\/p>\n<ol>\n<li>Metacritic, for critic scores. <\/li>\n<li>Wikipedia, for credits regarding director, writer, producer, composer, etc. This information is spotty and I can&#8217;t think of how it might be useful right now, but I&#8217;m going to include it as part of the exercise. Also, Wikipedia often notes what franchise a game is from, which might be handy if I want to do a search that includes &#8220;all <em>Resident Evil<\/em> games&#8221; or somesuch.\u00a0<\/li>\n<li>Steam, for PC -specific info like DRM, controller support, multiplayer, etc.<\/li>\n<\/ol>\n<p>There&#8217;s also a bit of information that can come from any of these sources: The url for the game&#8217;s official website might be handy, and we also need to get the publisher, developer, and release date from one of these places.<\/p>\n<p>Of the three sites, it seems like Metacritic is the best one to start with. It has games listed by platform, which is necessary in a structural sense. For the purposes of our database, it&#8217;s possible for the same game to have vastly different information depending on platform. For example, maybe a game is released on the Playstation 3 in 2010 by Beloved Developer, but then a year later it gets ported to the PC by Shovelware Games. Metacritic is the only place where we can get this information reliably. Steam obviously isn&#8217;t going to have non-PC data, and Wikipedia entries aren&#8217;t guaranteed to have all the per-platform data in an easy-to capture location<span class='snote' title='1'>It might be in the info box on the right, or it might be buried in the article text (good luck capturing THAT) or it might not be listed at all.<\/span>.\u00a0<\/p>\n<p>Metacritic even has a handy index page that you can go through:<!--more--><\/p>\n<p><code>www.metacritic.com\/browse\/games\/score\/metascore\/all\/pc\/filtered?view=detailed&amp;sort=desc&amp;page=0<\/code><\/p>\n<p>You can see that I can choose a platform by changing the bit where it says <code>\/pc\/<\/code> and I can choose a page by changing the number at the very end. This particular index will only list games with a rating of 30 or above, which will filter out a bunch of dross that we&#8217;re not interested in right now<span class='snote' title='2'>There&#8217;s another index that lists all games alphabetically with no filter. I might switch to that someday if I&#8217;m curious about all of the bottom-of-the-barrel stuff.<\/span>.<\/p>\n<p>So the plan is:<\/p>\n<p>Start at page 0. This page will return a list of 50 games. On this index page, all we get is the title, publication date, and a link to the full metacritic page for the game. We grab the names and the URLs, then move onto the next page. If we find a page with no games, then we&#8217;ve run past the end of the list and it&#8217;s time to stop.\u00a0<\/p>\n<p>Once we&#8217;re done, we have the base info for these games: Title, platform<span class='snote' title='3'>Just PC-only for this first run.<\/span>, and release date. The latter is important to avoid confusion over same-name sequels like Tomb Raider, Doom, Sim City, etc. So to uniquely identify a game in our database we&#8217;ll need all 3 pieces of info.\u00a0<\/p>\n<p>Once we read the whole index, we can go back and load the Metacritic page for each game and get the more detailed info: Developer, publisher, critic score, user score.<\/p>\n<p>That&#8217;s the plan, anyway. Now all we need to do is start scraping.<\/p>\n<h3>Ask for Permission, or Forgiveness?<\/h3>\n<p><div class='imagefull'><img src='https:\/\/www.shamusyoung.com\/twentysidedtale\/images\/stock_robot.jpg' width=100% alt='Unless you&apos;re making a Roomba competitor, you don&apos;t want to make a robot that sucks.' title='Unless you&apos;re making a Roomba competitor, you don&apos;t want to make a robot that sucks.'\/><\/div><div class='mouseover-alt'>Unless you&apos;re making a Roomba competitor, you don&apos;t want to make a robot that sucks.<\/div><\/p>\n<p>The thing is, it&#8217;s very important to make sure your bot is well-behaved. If you&#8217;re careless, it&#8217;s possible to &#8211; completely by accident &#8211; launch a denial-of-service attack on a website by simply building an ill-behaved bot.\u00a0<\/p>\n<p>It&#8217;s not that my humble residential connection is any threat to the mighty Metacritic, but that&#8217;s no reason to be careless.\u00a0<\/p>\n<p>Now, <b>technically<\/b> you&#8217;re supposed to have your bot read the robots.txt file. That&#8217;s a plain text file that tells robots how to behave. It tells the bot how often it&#8217;s allowed to make requests, and it tells the bot where it is and isn&#8217;t allowed to go.<\/p>\n<p>Normally, I&#8217;d be a Good Citizen and follow the rules. My problem is that a full implementation of robots.txt compliance can be fairly complicated:<\/p>\n<p>1) Read and store all of the directories and files you&#8217;re not allowed to scrape.<br \/>\n2) Every time you need something from the site, compare the prospective URL to the deny list to make sure you&#8217;re not inside of any of the forbidden zones or grabbing a forbidden type of file.<br \/>\n3) Create fallback behavior for when you can&#8217;t get to something you need.<br \/>\n4) Set up a test on my own webserver to prove that the system works as intended, or else all of the above was a waste of time.<\/p>\n<p>It&#8217;s not hard, but it would be time-consuming and it would end up making the project a lot bigger. In a practical sense, it would be better to MANUALLY view the robots.txt file, see what it says, and then scrap the entire project if it&#8217;s too restrictive. <\/p>\n<p>Making a well-behaved bot in this case would mean writing tons of code that, if it ever got used, would mean the entire project was pointless and I shouldn&#8217;t have bothered with any of it. <\/p>\n<p>But most importantly, if Metacritic says bots aren&#8217;t allowed to crawl for publicly-available information, <strong>I&#8217;m going to do it anyway because this sort of prohibition is stupid<\/strong>.<\/p>\n<p>I realize this comes off as uncharacteristically Renegade of me. I&#8217;m usually very Lawful Good about this sort of thing, but there comes a point where I think it&#8217;s important to draw the line.<\/p>\n<h3>It&#8217;s Just Me, and My Robot Friend<\/h3>\n<p><div class='imagefull'><img src='https:\/\/www.shamusyoung.com\/twentysidedtale\/images\/stock_robot_pepper.jpg' width=100% alt='Look at this friendly little bastard. I&apos;ll bet he&apos;s just one missed update from trying to kill us all.' title='Look at this friendly little bastard. I&apos;ll bet he&apos;s just one missed update from trying to kill us all.'\/><\/div><div class='mouseover-alt'>Look at this friendly little bastard. I&apos;ll bet he&apos;s just one missed update from trying to kill us all.<\/div><\/p>\n<p>Let&#8217;s say I&#8217;m walking down the street and I see a sign on a building that says &#8220;KEEP OUT!&#8221;\u00a0<\/p>\n<p>Okay. Cool. I&#8217;m going to stay out. This is someone else&#8217;s property, and they haven&#8217;t given me permission to enter. That&#8217;s fine. I&#8217;ll piss off.<\/p>\n<p>Now let&#8217;s say I&#8217;m walking down the street and I see a great big sign, which is brightly lit and easily visible to anyone walking by. I read the sign, and then at the bottom it says that I&#8217;m not allowed to tell other people about the sign. Or maybe I can tell them about it, but not take a picture of it. Or perhaps they don&#8217;t want me reading it aloud or writing it down.<\/p>\n<p>Some people see this as a perfectly reasonable request, but I can&#8217;t help but see it as an encroachment on <b>my<\/b> freedom. You can&#8217;t put something in my head and then insist I&#8217;m not allowed to tell other people about that thing. Your public-facing, reachable-by-Google page is a giant lit up sign facing the street, and you have no right to tell me I can&#8217;t use my camera in public and you certainly don&#8217;t have any right to inhibit my speech by demanding I keep your sign a secret.\u00a0<\/p>\n<p>If the sign is facing the public street, then presumably I&#8217;m allowed to read it. (And If I&#8217;m not, then it&#8217;s your fault for putting your sign there.) But see, I&#8217;m a slow reader. So I&#8217;m going to have my friend here read the sign for me. The fact that my friend is a robot is beside the point. He&#8217;s a friend. The point is that he&#8217;s helping me to read and remember what the sign says, and if I have permission then he does too.<\/p>\n<p>The information on Metacritic is visible to all. I&#8217;m just building a robot to look at it for me. It&#8217;s true that I can&#8217;t encroach on your property, but you can&#8217;t tell me what I can and can&#8217;t do with my robot.<\/p>\n<h3>On the OTHER Hand&#8230;<\/h3>\n<p>Some people have a totally different mental model of all of this. To them, going to a website is like going INSIDE someone&#8217;s building. Certainly you have the right to ban photography within your own building. You should be able to prohibit drones. Demanding people not tell others about what they see inside is a bit iffy without an NDA, but I think most people would agree you have rights over me while I&#8217;m in your home that you don&#8217;t have if we&#8217;re standing on the street.<\/p>\n<p>Using this mental model, the &#8220;no bots allowed&#8221; demand makes a lot more sense.\u00a0<\/p>\n<p>The problem is that neither of these mental models are correct, because the internet is very much its own thing. We try to map it to familiar ideas so we can import our existing collection of moral assumptions, norms, and etiquette. That works most of the time, but sometimes the novelty of this system is inescapable.\u00a0\u00a0<\/p>\n<p>Some people take this even further. Remember the whole <a href=\"https:\/\/en.wikipedia.org\/wiki\/Deep_linking#Court_rulings\">controversy over deep linking<\/a>? To one person, linking to an article is like telling someone else where you found the article. To another person, deep linking is somehow plagiarism \/ copyright infringement. I can&#8217;t really understand how anyone can come to the second conclusion, but I&#8217;m willing to bet the analogy \/ mental model they use to understand the internet is very different from mine.<\/p>\n<p>Anyway. Maybe you agree and you see a web scraper as just an automated tool for browsing the internet. (I could, after all, visit all these hundreds of pages myself and manually enter their contents into a database.) Maybe you think I&#8217;m a scoundrel and I should keep my robot out if I see a &#8220;No robots allowed&#8221; sign. That&#8217;s fine.\u00a0<\/p>\n<p>In either case, I would agree that none of this is an excuse for making a poorly-behaved bot. If nothing else, I&#8217;m going to make sure my bot is very quiet and doesn&#8217;t make too many demands on the webserver.<\/p>\n<h3>Bots are Not Created Equal<\/h3>\n<p><div class='imagefull'><img src='https:\/\/www.shamusyoung.com\/twentysidedtale\/images\/stock_robot_mars.jpg' width=100% alt='Stupid robots. They come to this planet and steal all our astronaut jobs!' title='Stupid robots. They come to this planet and steal all our astronaut jobs!'\/><\/div><div class='mouseover-alt'>Stupid robots. They come to this planet and steal all our astronaut jobs!<\/div><\/p>\n<p>Bots are, in theory, less demanding than humans. Let&#8217;s say I arrive at the front page for some huge tech company. My browser downloads the HTML for the page in question. But that page requires a bunch of other resources. Let&#8217;s see, there are a couple of CSS files, a custom font file, twenty gigantic images for the brochure-style slideshow, thirteen tiny little formatting images to put splashes of color and accents all over the place, a collection of Javascript files, an invisible IFRAME for external data harvesting by a partner company, which is itself a link to another page that might contain more assets.<\/p>\n<p>All told, we&#8217;ve got dozens of things to download before we have the full contents of the page. Rather than waiting for things to trickle in one at a time like in the old dial-up days, my browser will start downloading several of these things at once. I&#8217;m not sure how many simultaneous downloads is normal. The last time I paid attention to this stuff was in the mid 90s, when the typical number of simultaneous downloads was, like, five or something. I&#8217;m not sure how the technology works today, but I&#8217;d be surprised if that number hadn&#8217;t gone up.<\/p>\n<p>The average size of a webpage has gone up quickly over the years. Different sites give different numbers, but everyone seems to agree that it&#8217;s <b>at least<\/b> 2 megabytes to download the average webpage here in 2020. That is, the size of the download to check the front page of your average news site will be larger than the entire install of DOOM in 1993. That&#8217;s your <b>average<\/b> page, and when you&#8217;re talking about a corporate front page with an image slideshow, I&#8217;d be very surprised if it came in under 10MB.<\/p>\n<p>This is obviously MASSIVELY bloated, considering you&#8217;re just here to read some text. But keeping things small and efficient is expensive and time consuming and users seem to have grown accustomed to waiting a few seconds for a site to load on their phone. The public doesn&#8217;t care, so nobody&#8217;s willing to spend the money to make this stuff smaller. This also means that every time mobile networks get faster, pages will grow to consume more bandwidth until you&#8217;re back up to 5 seconds of loading time again. This reminds me of the 90s, when the latest version of Windows was guaranteed to eat up all the new RAM you just bought.<\/p>\n<p>My point is that a typical scraper bot doesn&#8217;t give a damn about any of the extra content. It downloads the raw HTML, and it doesn&#8217;t download any of the required CSS, images, scripts, or other nonsense. To the bot, the site is only a hundred or so kilobytes &#8211; almost nothing<span class='snote' title='4'>I just looked, and my site seems to have about 50Kb of overhead. That is, a post with no content and no comments would still be a 50KB html file. That sounds big, but like the big corporations I&#8217;m too busy \/ lazy to investigate further.<\/span> .<\/p>\n<p>So bots are harmless, right?<\/p>\n<p>The problem is that while bots don&#8217;t typically download all the bloat, they read millions of times faster than a human. You might click a link every minute or so, but the bot will happily devour a hundred pages a second if you allow it. Like, a hundred 100KB files? That&#8217;s not even a big deal. Your bot could do that on one core while you&#8217;re playing Doom Eternal on the rest.<\/p>\n<p>So the first step to making a well-behaved bot is making a bot that doesn&#8217;t get too greedy. Amazingly, this is one of those rare instances where the lazy thing is the optimal thing.\u00a0<\/p>\n<p>What you&#8217;re supposed to do is request a webpage and have it download in a background thread. Then your main program keeps running. Maybe it even kicks off more threads. Then your program comes back around and checks to see if the downloads are complete. If done properly, your bot can keep many plates spinning at the same time, because multitasking is easy for computers.<\/p>\n<p>But!\u00a0<\/p>\n<p>That&#8217;s a lot of work. It&#8217;s also possible to NOT put the download in a background thread. You can, if you want, start the download in your main thread. If you do that, then your program will sit there effectively locked up until the download completes or fails. If you do it this way, then you&#8217;ll never have more than one active download going at a time. If you put a cooldown timer on it, then you can make sure your bot will never hit the server hard enough for anyone to care.\u00a0<\/p>\n<p>I put a one-second cooldown on the bot, meaning my bot will never load more than one page a second. In practice, it&#8217;s more like a page every other second because there&#8217;s a little overhead to starting each download.<\/p>\n<p>I aim the bot at my own site just to make sure I didn&#8217;t do something really stupid. Once it successfully downloads a few things, and I confirm it&#8217;s working properly, I aim it at Metacritic and begin harvesting data.\u00a0<\/p>\n<p>I haven&#8217;t read the robots.txt file from Metacritic so I don&#8217;t know if my bot is welcome here, but the bot is using a <b>ridiculously<\/b> small amount of resources.\u00a0<\/p>\n<p>Next time I&#8217;ll talk about parsing these pages and combining their information with stuff from other sites.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>So now I&#8217;m done messing around and being silly. It&#8217;s time to actually scrape the web for stuff. There are three different sites I&#8217;m interested in: Metacritic, for critic scores. Wikipedia, for credits regarding director, writer, producer, composer, etc. This information is spotty and I can&#8217;t think of how it might be useful right now, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[66],"tags":[],"class_list":["post-49866","post","type-post","status-publish","format-standard","hentry","category-programming"],"_links":{"self":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts\/49866","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=49866"}],"version-history":[{"count":18,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts\/49866\/revisions"}],"predecessor-version":[{"id":49884,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts\/49866\/revisions\/49884"}],"wp:attachment":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=49866"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=49866"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=49866"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}