Scraping Part 3: A Well-Behaved Bot

By Shamus Posted Thursday Apr 30, 2020

Filed under: Programming 49 comments

So now I’m done messing around and being silly. It’s time to actually scrape the web for stuff. There are three different sites I’m interested in:

  1. Metacritic, for critic scores.
  2. Wikipedia, for credits regarding director, writer, producer, composer, etc. This information is spotty and I can’t think of how it might be useful right now, but I’m going to include it as part of the exercise. Also, Wikipedia often notes what franchise a game is from, which might be handy if I want to do a search that includes “all Resident Evil games” or somesuch. 
  3. Steam, for PC -specific info like DRM, controller support, multiplayer, etc.

There’s also a bit of information that can come from any of these sources: The url for the game’s official website might be handy, and we also need to get the publisher, developer, and release date from one of these places.

Of the three sites, it seems like Metacritic is the best one to start with. It has games listed by platform, which is necessary in a structural sense. For the purposes of our database, it’s possible for the same game to have vastly different information depending on platform. For example, maybe a game is released on the Playstation 3 in 2010 by Beloved Developer, but then a year later it gets ported to the PC by Shovelware Games. Metacritic is the only place where we can get this information reliably. Steam obviously isn’t going to have non-PC data, and Wikipedia entries aren’t guaranteed to have all the per-platform data in an easy-to capture locationIt might be in the info box on the right, or it might be buried in the article text (good luck capturing THAT) or it might not be listed at all.

Metacritic even has a handy index page that you can go through:

www.metacritic.com/browse/games/score/metascore/all/pc/filtered?view=detailed&sort=desc&page=0

You can see that I can choose a platform by changing the bit where it says /pc/ and I can choose a page by changing the number at the very end. This particular index will only list games with a rating of 30 or above, which will filter out a bunch of dross that we’re not interested in right nowThere’s another index that lists all games alphabetically with no filter. I might switch to that someday if I’m curious about all of the bottom-of-the-barrel stuff..

So the plan is:

Start at page 0. This page will return a list of 50 games. On this index page, all we get is the title, publication date, and a link to the full metacritic page for the game. We grab the names and the URLs, then move onto the next page. If we find a page with no games, then we’ve run past the end of the list and it’s time to stop. 

Once we’re done, we have the base info for these games: Title, platformJust PC-only for this first run., and release date. The latter is important to avoid confusion over same-name sequels like Tomb Raider, Doom, Sim City, etc. So to uniquely identify a game in our database we’ll need all 3 pieces of info. 

Once we read the whole index, we can go back and load the Metacritic page for each game and get the more detailed info: Developer, publisher, critic score, user score.

That’s the plan, anyway. Now all we need to do is start scraping.

Ask for Permission, or Forgiveness?

Unless you're making a Roomba competitor, you don't want to make a robot that sucks.
Unless you're making a Roomba competitor, you don't want to make a robot that sucks.

The thing is, it’s very important to make sure your bot is well-behaved. If you’re careless, it’s possible to – completely by accident – launch a denial-of-service attack on a website by simply building an ill-behaved bot. 

It’s not that my humble residential connection is any threat to the mighty Metacritic, but that’s no reason to be careless. 

Now, technically you’re supposed to have your bot read the robots.txt file. That’s a plain text file that tells robots how to behave. It tells the bot how often it’s allowed to make requests, and it tells the bot where it is and isn’t allowed to go.

Normally, I’d be a Good Citizen and follow the rules. My problem is that a full implementation of robots.txt compliance can be fairly complicated:

1) Read and store all of the directories and files you’re not allowed to scrape.
2) Every time you need something from the site, compare the prospective URL to the deny list to make sure you’re not inside of any of the forbidden zones or grabbing a forbidden type of file.
3) Create fallback behavior for when you can’t get to something you need.
4) Set up a test on my own webserver to prove that the system works as intended, or else all of the above was a waste of time.

It’s not hard, but it would be time-consuming and it would end up making the project a lot bigger. In a practical sense, it would be better to MANUALLY view the robots.txt file, see what it says, and then scrap the entire project if it’s too restrictive.

Making a well-behaved bot in this case would mean writing tons of code that, if it ever got used, would mean the entire project was pointless and I shouldn’t have bothered with any of it.

But most importantly, if Metacritic says bots aren’t allowed to crawl for publicly-available information, I’m going to do it anyway because this sort of prohibition is stupid.

I realize this comes off as uncharacteristically Renegade of me. I’m usually very Lawful Good about this sort of thing, but there comes a point where I think it’s important to draw the line.

It’s Just Me, and My Robot Friend

Look at this friendly little bastard. I'll bet he's just one missed update from trying to kill us all.
Look at this friendly little bastard. I'll bet he's just one missed update from trying to kill us all.

Let’s say I’m walking down the street and I see a sign on a building that says “KEEP OUT!” 

Okay. Cool. I’m going to stay out. This is someone else’s property, and they haven’t given me permission to enter. That’s fine. I’ll piss off.

Now let’s say I’m walking down the street and I see a great big sign, which is brightly lit and easily visible to anyone walking by. I read the sign, and then at the bottom it says that I’m not allowed to tell other people about the sign. Or maybe I can tell them about it, but not take a picture of it. Or perhaps they don’t want me reading it aloud or writing it down.

Some people see this as a perfectly reasonable request, but I can’t help but see it as an encroachment on my freedom. You can’t put something in my head and then insist I’m not allowed to tell other people about that thing. Your public-facing, reachable-by-Google page is a giant lit up sign facing the street, and you have no right to tell me I can’t use my camera in public and you certainly don’t have any right to inhibit my speech by demanding I keep your sign a secret. 

If the sign is facing the public street, then presumably I’m allowed to read it. (And If I’m not, then it’s your fault for putting your sign there.) But see, I’m a slow reader. So I’m going to have my friend here read the sign for me. The fact that my friend is a robot is beside the point. He’s a friend. The point is that he’s helping me to read and remember what the sign says, and if I have permission then he does too.

The information on Metacritic is visible to all. I’m just building a robot to look at it for me. It’s true that I can’t encroach on your property, but you can’t tell me what I can and can’t do with my robot.

On the OTHER Hand…

Some people have a totally different mental model of all of this. To them, going to a website is like going INSIDE someone’s building. Certainly you have the right to ban photography within your own building. You should be able to prohibit drones. Demanding people not tell others about what they see inside is a bit iffy without an NDA, but I think most people would agree you have rights over me while I’m in your home that you don’t have if we’re standing on the street.

Using this mental model, the “no bots allowed” demand makes a lot more sense. 

The problem is that neither of these mental models are correct, because the internet is very much its own thing. We try to map it to familiar ideas so we can import our existing collection of moral assumptions, norms, and etiquette. That works most of the time, but sometimes the novelty of this system is inescapable.  

Some people take this even further. Remember the whole controversy over deep linking? To one person, linking to an article is like telling someone else where you found the article. To another person, deep linking is somehow plagiarism / copyright infringement. I can’t really understand how anyone can come to the second conclusion, but I’m willing to bet the analogy / mental model they use to understand the internet is very different from mine.

Anyway. Maybe you agree and you see a web scraper as just an automated tool for browsing the internet. (I could, after all, visit all these hundreds of pages myself and manually enter their contents into a database.) Maybe you think I’m a scoundrel and I should keep my robot out if I see a “No robots allowed” sign. That’s fine. 

In either case, I would agree that none of this is an excuse for making a poorly-behaved bot. If nothing else, I’m going to make sure my bot is very quiet and doesn’t make too many demands on the webserver.

Bots are Not Created Equal

Stupid robots. They come to this planet and steal all our astronaut jobs!
Stupid robots. They come to this planet and steal all our astronaut jobs!

Bots are, in theory, less demanding than humans. Let’s say I arrive at the front page for some huge tech company. My browser downloads the HTML for the page in question. But that page requires a bunch of other resources. Let’s see, there are a couple of CSS files, a custom font file, twenty gigantic images for the brochure-style slideshow, thirteen tiny little formatting images to put splashes of color and accents all over the place, a collection of Javascript files, an invisible IFRAME for external data harvesting by a partner company, which is itself a link to another page that might contain more assets.

All told, we’ve got dozens of things to download before we have the full contents of the page. Rather than waiting for things to trickle in one at a time like in the old dial-up days, my browser will start downloading several of these things at once. I’m not sure how many simultaneous downloads is normal. The last time I paid attention to this stuff was in the mid 90s, when the typical number of simultaneous downloads was, like, five or something. I’m not sure how the technology works today, but I’d be surprised if that number hadn’t gone up.

The average size of a webpage has gone up quickly over the years. Different sites give different numbers, but everyone seems to agree that it’s at least 2 megabytes to download the average webpage here in 2020. That is, the size of the download to check the front page of your average news site will be larger than the entire install of DOOM in 1993. That’s your average page, and when you’re talking about a corporate front page with an image slideshow, I’d be very surprised if it came in under 10MB.

This is obviously MASSIVELY bloated, considering you’re just here to read some text. But keeping things small and efficient is expensive and time consuming and users seem to have grown accustomed to waiting a few seconds for a site to load on their phone. The public doesn’t care, so nobody’s willing to spend the money to make this stuff smaller. This also means that every time mobile networks get faster, pages will grow to consume more bandwidth until you’re back up to 5 seconds of loading time again. This reminds me of the 90s, when the latest version of Windows was guaranteed to eat up all the new RAM you just bought.

My point is that a typical scraper bot doesn’t give a damn about any of the extra content. It downloads the raw HTML, and it doesn’t download any of the required CSS, images, scripts, or other nonsense. To the bot, the site is only a hundred or so kilobytes – almost nothingI just looked, and my site seems to have about 50Kb of overhead. That is, a post with no content and no comments would still be a 50KB html file. That sounds big, but like the big corporations I’m too busy / lazy to investigate further. .

So bots are harmless, right?

The problem is that while bots don’t typically download all the bloat, they read millions of times faster than a human. You might click a link every minute or so, but the bot will happily devour a hundred pages a second if you allow it. Like, a hundred 100KB files? That’s not even a big deal. Your bot could do that on one core while you’re playing Doom Eternal on the rest.

So the first step to making a well-behaved bot is making a bot that doesn’t get too greedy. Amazingly, this is one of those rare instances where the lazy thing is the optimal thing. 

What you’re supposed to do is request a webpage and have it download in a background thread. Then your main program keeps running. Maybe it even kicks off more threads. Then your program comes back around and checks to see if the downloads are complete. If done properly, your bot can keep many plates spinning at the same time, because multitasking is easy for computers.

But! 

That’s a lot of work. It’s also possible to NOT put the download in a background thread. You can, if you want, start the download in your main thread. If you do that, then your program will sit there effectively locked up until the download completes or fails. If you do it this way, then you’ll never have more than one active download going at a time. If you put a cooldown timer on it, then you can make sure your bot will never hit the server hard enough for anyone to care. 

I put a one-second cooldown on the bot, meaning my bot will never load more than one page a second. In practice, it’s more like a page every other second because there’s a little overhead to starting each download.

I aim the bot at my own site just to make sure I didn’t do something really stupid. Once it successfully downloads a few things, and I confirm it’s working properly, I aim it at Metacritic and begin harvesting data. 

I haven’t read the robots.txt file from Metacritic so I don’t know if my bot is welcome here, but the bot is using a ridiculously small amount of resources. 

Next time I’ll talk about parsing these pages and combining their information with stuff from other sites.

 

Footnotes:

[1] It might be in the info box on the right, or it might be buried in the article text (good luck capturing THAT) or it might not be listed at all.

[2] There’s another index that lists all games alphabetically with no filter. I might switch to that someday if I’m curious about all of the bottom-of-the-barrel stuff.

[3] Just PC-only for this first run.

[4] I just looked, and my site seems to have about 50Kb of overhead. That is, a post with no content and no comments would still be a 50KB html file. That sounds big, but like the big corporations I’m too busy / lazy to investigate further.



From The Archives:
 

49 thoughts on “Scraping Part 3: A Well-Behaved Bot

  1. Lee says:

    So, did you actually go read Metacritic’s robots.txt before deciding to ignore it, or no? It’s only 10 lines, and doesn’t seem to actually deny anything you’ve talked about so far. I’m just wondering.

    1. Duoae says:

      But Shamus is not a robot… why and how would he even read a text file created solely for robots?! He’d have to have some sort of C-3P0-esque translation droid…

      Waitaminute. You read it…

      *looks suspiciously at Lee*

      1. Paul Spooner says:

        Yeah, you’ll want humans.txt instead. Most sites don’t have one though.

        1. DerJungerLudendorff says:

          Isn’t that just the front page?

    2. Alecw says:

      Why?
      He has decided upon what he’s going to do so he’s probably better off not knowing.

      He won’t take any private info and won’t break the site, and if the bots file goes further he will ignore it anyway…

  2. gresman says:

    Have you thought about directly accessing Steam’s API instead of scraping their data?
    If I am not mistaken Steamdb does that.

    I like data aggregation websites like steamdb or steamcharts.
    Great now I miss SteamSpy again.

  3. methermeneus says:

    I’d’ve at least checked, but it does look like you’re being a good citizen after all. Here’s Metacritic’s robots.txt reproduced in full:

    User-agent: *
    Disallow: /search
    Disallow: /signup
    Disallow: /login
    Disallow: /user
    Disallow: /jl/
    # Google is crawling the Ad defineSlot() parameters. Exclude them so we don’t get a bunch of 404s.
    Disallow: /8264/
    Disallow: /7336/
    Sitemap: https://www.metacritic.com/siteindex.xml

    I’ll never understand why deep-linking is supposed to be a bad thing. I get that a lot of websites would prefer you link to their home page (if I go home – > search -> three-pages-that-sound-about-right, I’m seeing five times the ads than if I just went straight to the source page), but deep links are generally telling you how to get something, be it a specific item for sale or information from a citation, in which case it’s best to be specific. A citation especially provides volume, issue, and page for print media, so providing URL (often to a webpage that’s several pages long) is the equivalent or less to that level of specificity. If you want to force people to go through your home page or login page instead, it’s not difficult to redirect all outside traffic to a specific landing page yourself, circumventing anyone who tries deep-linking you anyway. Heck, it wasn’t hard back in ’96 when the first major deep-linking case went to court!
    </rant>

    1. Dreadjaws says:

      I don’t get it either. I assume it has something to do with ads, and how to maximize their presence. If a user clicks on a deep link, they’re automatically transfered to where they want to be, and might see a few ads on that page, but if they’re forced to go to the homepage and then browse the website looking for what they want then they will see more ads.

      I mean, it’s clear that many websites prioritize ad viewing over user experience, usually to the point where it becomes counterproductive and people either turn adblocking on or even avoid the website altogether, so I wouldn’t put it past them that this kind of thing is at least partially to blame.

      If you want to force people to go through your home page or login page instead, it’s not difficult to redirect all outside traffic to a specific landing page yourself, circumventing anyone who tries deep-linking you anyway.

      As the whole deal with Shamus and the Bethesda store showed the other day, it’s evident that many companies just outsource their website building and then promptly forget about it. They might get into legal action if they perceive some external agent is affecting them negatively (as in, in these examples, is the case with deep linking), but they’re not going to move a finger to try to fix things from their side, or to even try to find out if they can.

      1. Paul Spooner says:

        “many companies just outsource their website building” made me think of a corporate new hire campus tour.
        “And over here” tour guide gestures to gently smoldering crater “is our website building!”

    2. Erik says:

      The technical reason against deep linking is that your link now depends on the site storage organization never changing. If a site changes CMSs, all deep links usually die. Top-level links don’t.

      I don’t personally find this compelling, but that’s the rationale.

      1. Melted says:

        I don’t find it compelling, either. Sure, links to specific pages can turn into dead links for the reason you mention, but what’s the alternative supposed to be? Just telling someone, “Oh, you should check out this article on voxels, it’s on shamusyoung.com… you know, somewhere. Just do a search or something.”

        And of course, any description of how to navigate the website to get to the page in question is even more vulnerable to becoming obsolete.

        1. methermeneus says:

          That’s an argument for why it’s not necessarily best practice, and I somewhat agree with it. (Just today I was looking up how to do something, and a forum post from 2014 had a link to the manual, and I’m not going to pass up an opportunity to rtfm, but it was a dead link that redirected to the homepage that I couldn’t figure out how to navigate to the manual from.) Unfortunately, it doesn’t have anything to do with the controversy, which is that some entities consider deep-linking of their sites by outside entities to be some form of infringement, which is probably related to ad revenue or maximizing their presence like Dreadjaw said.

          1. Decius says:

            If you can’t get to the manual from the homepage now, there’s little reason to think that you could do so then. Deeplinking was always required.

      2. Echo Tango says:

        Those links won’t expire on the Internet Archive, though. ;)

    3. Abnaxis says:

      Thinking about it, do any of these count as “deeplinking”…?

      – Deeplinking to another site, but using images/href text to give the impression the linked content is your own
      – Creating an iframe that displays other’s content within a header/frame of your own design
      – Scraping a site, and creating your own front page where users can search the database you created to navigate to the other site.
      – Heck, even creating your own fake phishing page that gathers login credentials before forwarding users to the legitimate site is basically deeplinking…

      This is all just sort of 5-minutes-off-the-top-of-my-head, but I’m pretty sure with enough effort someone who wants to make an easy buck off of a popular page without bothering with scruples will have a field day if you have blanket clearance to deeplink however you want.

      1. Zak McKracken says:

        I’d say that all of these go significantly beyond just plain deep-linking.

    4. Zak McKracken says:

      The only place where I’ve seen an argument against deep-linking which I understood was some site which asked people not to do it because every time they reorganized the site, those links would change, and anyone using the link would end up at a 404 page — this was stated in the form of friendly advice to users, including reference to a little button at the bottom of every article which provided a unique, permanent URL for every piece of content and a recommendation to use that instead.

      This is the angle I understand. If someone wants to send someone else a link to some news article or product information page, or post it somewhere, I can’t see why the site operator could possibly object to it, or why they should have any legal leverage to prevent it. That is precisely what URLs are for: to store and refer to addresses where online content can be found.

  4. Dreadjaws says:

    Remember the whole controversy over deep linking?

    Wow, I had no idea that was ever a thing. It’s not really surprising, though. Every day you see people upset about the sort of things you could have never imagined could ever bother anyone. Sometimes it’s justified (a personal experience has let them in the defensive about a particular subject), but sometimes it’s entirely preposterous (i.e. their negative behavior is born out of ignorance and/or stuborness).

    Like, for instance, in some articles talking about game mods allowing you to play as a certain character, adding a new mode or a whole new fan-made episode, once in a while you’ll have some people decry how much they don’t like mods (one particular comment that stuck to mind was “This is why I hate mods. Leave games alone!”), even though they’re entirely unaffected by them.

    I mean, yes, people hate things that don’t affect them all the time. People who engage in sexual behavior that’s extraneous to others are often the target of scorn, for instance. But while not justifiable, this can at least be usually explained by upbringing. But how the hell do you explain hating modding? Am I at some point going to run into some guy holding a sign that says “God hates mods, Matthew 16:26”? I don’t get it. I try to make my brain work really hard, and I still can’t comprehend it. Like, are all of these people under the silly impression that mods are mandatory or something like that? It boggles my mind.

    1. Amanda says:

      I think it stems from being upset that people won’t experience the original vision the artist(s) intended. And while that reasoning is largely understandable, it still seems a bit silly to me to gett that upset over people enjoying art wrong, but I suppose this isn’t the only case where that happens.

      1. Echo Tango says:

        We’re talking about sold copies of an artistic work. Game devs could put their works in museums, but if customers want to play the game with the main character wearing hot pink spandex and a clown wig, that’s none of the devs’ business. If someone’s going to ban mods, will they also try to ban people listening to podcasts while the game plays? (And how much malware would you need, to enforce that?)

        1. Chad Miller says:

          I don’t think even the most extreme opinions mentioned in this thread are meant to be people calling for anyone to be forced to stop using mods. If a restaurant critic, annoyed that people ruin their meals by drowning them in cheap condiments said “This is why I hate ketchup. Leave steak alone!” I don’t think anyone would take that to mean they were advocating banning ketchup from steakhouses so much as just looking down on the common person’s tastes.

          1. Echo Tango says:

            In my experience, these comments are usually phrased as actually advocating for mods to not exist, so the devs have more time to focus on the posters’ desires.

          2. Philadelphus says:

            See, to me, as a lover of ketchup, I’m improving it, not ruining it. Everything is improved with a little ketchup.

            1. Syal says:

              I keep two kinds of ketchup. The first ketchup goes on the food, and the second ketchup goes on the first ketchup.

    2. Syal says:

      I ignore mods but I hate hearing about mods, especially as a solution to a game’s problems. “Just hack the game” is not a reasonable solution.

      Like, if a game review is talking about what mods allow you to do, I’m going to hold it against the review, because mods are not the game. Might as well say the game is good because the developer is funny in interviews. Has nothing to do with the product.

      1. Echo Tango says:

        Mods are like, 90% of Rimworld… ^^;

        (and Tynan keeps breaking compatibility…)

      2. Richard says:

        It depends.

        If the game review is basically “The game is great, and there’s some fun mods too” then yay!

        If it’s “The game is terrible, but here’s some mods that make it fun” then boo, boo and thrice boo.

        1. Chad Miller says:

          Yeah, I liken that to saying “That movie does not have plot holes! My fan fiction explains it!”

      3. Daniil says:

        Mods have plenty to do with the kind of experience you can potentially have with a game. For me it’s a pretty big point in favour of some comparatively older games (Civilization 4, say) that they have a lot of user content. I wouldn’t say they have nothing to do with the product either. Some games are very moddable by design, some are really not, and that much does depend on the developer, though whether their game gets a strong modding community or not is something they have less control over. What I will agree with is that if a game has glaring problems but a mod fixes it, that isn’t much of a point in favour of the developers (but perhaps worth mentioning anyway, for those who liked the idea of the game, didn’t like the execution, but may be pleased to know that there is a workaround – perhaps not in the game review, though…).

        1. Decius says:

          I think a common thing to happen is that a game has a feature that is controversial, and there’s a mod that takes a different position on the controversy.

      4. Moridin says:

        I don’t really get this attitude. Ignoring the mods when making a purchase decision is fine, but if you have a problem with a game that you already own, why would you ignore the solutions to that problem?

        And even if you don’t have problems with the game as such, mods can in many cases make a game you like even better, or increase the replayability. Fallout New Vegas is a good game in vacuum, but once you add mods to the equation, replaying it becomes a lot more appealing.

        1. Syal says:

          why would you ignore the solutions to that problem?

          If it bothers me enough to leave the game and find a solution, it bothers me enough to leave the game and not come back. I’m not going to forget I had to hack around the game, and am going to be bothered by how broken the base system is the whole time.

          That’s assuming I can find a good mod. Mod communities are fan communities, and fan communities are mind-bogglingly dumb. Fans will neglect important information like “this mod needs this other mod” or “these two mods can’t be used together”, or will just lie about what a mod will do. Wading through noisy nonsense is a big part of my real life and I’m not going to do it as a hobby. (…I mean, I do, but not as part of something, it’s pure cathartic “look how bad this is, glad I’m not dealing with it”.)

          1. tmtvl says:

            Yeah, if Steam had the refund policy when Skyrim came out I would’ve refunded it instead of trying to make a bunch of mods work together to make the game bearable.

    3. Echo Tango says:

      Linking people to pages on another website seems pretty fair, since the person gets transferred to the other website, which adds legitimate traffic to that website. The issue comes from linking images or other media[1], to show on your own web-page. For example, if the images Shamus is showing[2] between sections of text weren’t hosted on his own website, that would be an issue. People viewing the images causes resource-load on the hoster’s servers, without the benefit[3] of people viewing the original websites, pages, or advertisements, where those images came from.

      [1] Because of how most videos are streamed, this isn’t usually a problem. On Shamus’ own website for example, he embeds YouTube videos. The videos have an obvious YouTube logo, which is also a button to open the video directly on the YouTube website. No confusion, and actually encouraged by YouTube, since I’m pretty sure they’re the ones who provide the code-snippet, to embed the video-player / video.

      [2] Ignore copyright here. Or assume the images Shamus is hosting are licenced apropriately. It’s a tangential argument.

      [3] It also causes confusion in the viewers, because unless they right-click on the images to see the URL, they think everything is from Shamus’ website. In the USA, that appears to not matter; The ruling against Pefect 10 explained that copyright law doesn’t protect against customer confusion, unlike patent law.

      1. Decius says:

        Hotlinking images is explicitly NOT copyright infringement. It’s bad etiquette, and it might be trademark misuse.

    4. pseudonym says:

      Some people are under the impression that, when somebody else makes different choices, it is a direct attack on their choices.

      I see it on computer hardware sites all the time. When somebody on a hardware forum states he bought an Intel he will get slammed by people proclaiming that AMD is much better at pricepoint x. Then people will react that intel has higher framerates in some game titles. The more rabid people are in the debate the more accurately predict you can predict the PC in their user profile.

      1. Echo Tango says:

        But there’s only so many market dollars to go around. If people go around buying Brand X, then my precious Brand Y might not have enough research dollars to get me new features next year!

  5. SupahEwok says:

    MobyGames is probably a better source for metadata than Wikipedia, if you ever pick this up again in the future.

  6. J Greely says:

    Long ago and far away, my friend Jeff (no, the other one) was responsible for a popular monitoring tool. In the default configuration file that shipped with it, he included a sample line that tested a real web server under his control. Years later, that web server was under my control, and was still receiving thousands of connections per hour from around the world. In order to make the data in my logs useful, I had to add it to my blackhole list, right alongside all the script-kiddie nonsense probing for unsecured management pages and well-known security holes. It was over ten years before the requests for this URL finally stopped.

    This was actually one of the least-annoying automated requests that hit my site. The typical bots were high-speed spiders that overwhelmed my tiny little server every day, so my blackhole list was literal: any IP address that requested a “bad” URL was immediately added to a firewall block rule, and the list was only emptied once a week.

    Where this might become relevant to a Good Robot is that I embedded a bad URL into every page. Users would never see it, and could never click on it. But scrapers that ignored the entries in robots.txt would gleefully attempt to retrieve /banmyipaddress/index.html, and discover that they could no longer see my server at all.

    A curious friend once idly viewed the source of my blog, noticed the odd URL, and thought, “gosh, I wonder what that does”. A few minutes later, I received his email begging to be allowed back in. :-)

    -j

  7. Echo Tango says:

    But Shamus, if your bot only downloads the text on pages, how can Metacritic serve you ads?

  8. Codesections says:

    The sort of web scraping you’re doing is also legal under current US law, incidentally: https://en.m.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn. So courts also agree with your mental model of how this works, for whatever that’s worth.

  9. kdansky says:

    The idea of robots.txt was to make it possible for a site owner to give recommendations to make the bot’s lives easier. Restrict video content because the bot can’t parse it anyway. Mark dynamically generated pages are pointless, as the bot won’t find them very interesting. Tell it where the juicy information is, so they don’t have to crawl everything.

    Turns out that it’s a great idea for engineers among one another, and soon the lawyers show up and try to abuse a feature to build restrictions, just like these utterly inane email footers that blab for two pages about how unintended readers are not allowed from reading it. That would never fly in court anyway, as like the flashing sign, if I can read it without signing a contract about my use of it, you cannot prohibit me from talking about it any more.

    And that leads us to the now, where we are best off ignoring the robots.txt unless we are so powerful everybody wants to play nice with us (nobody wants to block google from indexing their page, it would be suicide), and where we try to spoof the browser agent because otherwise some zealous admin will ban you sight unseen. For example the RPG.net forums will IP-ban you for sending “curl” as your user agent. Which means my crawler claimed to be Firefox to download a single thread I wanted to read offline.

  10. Chad Miller says:

    Re: mental models of property lines on the Internet: Disney recently tried to claim that anyone using a specific Twitter hashtag agrees to their Terms of Service: https://twitter.com/disneyplus/status/1254772307941191686

    1. Echo Tango says:

      I was going to type some onomatopoeia for laughing uproariously, but it would have taken up like, the whole page. ^^;

  11. David says:

    I try to obey the Retry-After header, if a site sends it. It’s not super common, but it’s such an explicit signal that I can’t help but go along with it… particularly if the site does a full 429 (Too Many Requests) response when you exceed it, since in that case it’s them telling you how to not get blocked.

    1. Rohan says:

      Thanks for reminding me of HTTP Cats. 429 is a favourite.

      https://http.cat/429

  12. pseudonym says:

    It’s not hard, but it would be time-consuming and it would end up making the project a lot bigger.

    If you build it yourself. Yes.

    But what if you could check every link for robots.txt compliance and simply skip it if does not comply? That would be quite trivial. Only that robots.txt parsing and checking would be hard but there are libraries for this: https://stackoverflow.com/a/633539.

    Again, the most work is finding out what is already done. Like you said in the first entry of this series ;-).

  13. MaxEd says:

    I once frequented a dating site. As it is always with dating sites, it was full of people who think they just need to put a few photos and that’s all. I was not at all interested in people like that: no matter how attractive the woman in photo is, I’m not going to write her a message unless I have some idea about her hobbies, tastes, anything to make the initial message more interesting and detailed that “hi”. I don’t fall in love with photos. Unfortunately, that site (and all other sites in existence) was geared toward “photo-oriented” people and provided no way to filter out those profiles that have no additional information. So I had to take matters in my own hands.

    Unfortunately, dating sites are pretty well-protected against scrapping, and for a good reason: it’s a common practice for a newly opening dating site to scrape, buy or steal profiles from older sites, so as not to be seen unpopulated. Sure, those profiles will never respond, since there are no real users running them (or, if the site is a total scam, a bot will respond), but at least the new real users coming to a sing-up page won’t be seeing an empty list of potential partners (which is a show-stopper for a dating site, as you may guess; actually, I wonder how do you start one legitimately, how do you get the first users to join, when there is no one to talk to?).

    I’ve tried my hand at scrapping anyway, but was forced to turn back, because I was unable to circumvent scrapping protections. So instead I wrote a Firefox add-on. I logged in normally as myself, went to a search page, typed in my usual age/sex preferences, scrolled it down to load more profiles, and then pressed a button in my toolbar. My add-on would then open all profiles from search in separate tabs, wait for them to load, look for portions of profile I wanted to be filled (there were a few possible sections), checked the list of interests against my own, and closed all tabs where nothing was found. It wasn’t as smooth experience as a scrapper would provide: my browser would appear to hang while the script worked, because it also paused for a random period of time between opening profiles, to avoid overloading the site and being detected as a bot, but it still worked: in the end, I got a list of really interesting profiles.

    I only managed to run the script a few times, and got one date from it, before re-connecting with a girl I dated before on another site, and finally hooking up with her for good, but I still think it was a good, simple script that added a feature the site was sorely missing (up to the point of being nearly unusable for me, since it also lacked any useful way to match users by interests, which was always much more important to me than looks).

  14. Gordon says:

    IMDB also has useful credit info and might be more complete from wiki as they aim for completeness while wiki aims for notoriety

    e.g.
    https://www.imdb.com/title/tt2321297/fullcredits?ref_=tt_cl_sm#cast

Thanks for joining the discussion. Be nice, don't post angry, and enjoy yourself. This is supposed to be fun. Your email address will not be published. Required fields are marked*

You can enclose spoilers in <strike> tags like so:
<strike>Darth Vader is Luke's father!</strike>

You can make things italics like this:
Can you imagine having Darth Vader as your <i>father</i>?

You can make things bold like this:
I'm <b>very</b> glad Darth Vader isn't my father.

You can make links like this:
I'm reading about <a href="http://en.wikipedia.org/wiki/Darth_Vader">Darth Vader</a> on Wikipedia!

You can quote someone like this:
Darth Vader said <blockquote>Luke, I am your father.</blockquote>

Leave a Reply to Philadelphus Cancel reply

Your email address will not be published.