Scraping Part 1: Easy Mode

By Shamus Posted Thursday Apr 23, 2020

Filed under: Programming 90 comments

You might remember a couple of months ago I posted a bunch of charts of video game data. The obvious question that went unanswered in those postsTo the genuine annoyance of some. was, “Where did this data come from?” So let’s talk about that.

Actually, before we talk about that I should make it clear that this is a programming project. I should note that that this project pre-dates that crazy stuff I was doing with BSP loading a couple of weeks ago, but I’m posting them in the opposite order. For some reason.

Maybe reading yet another programming project sounds fun, but this isn’t a game-focused project with cool screenshots to show off my project. This is pretty dry and you’ve already seen the end result. I’d talk you out of reading more, but we both know you’re going to read this stupid thing no matter what I say. So Let’s just get this over with.

For years, I’ve been wondering about the stuff we’re always discussing / arguing about in gaming culture. The division between fans and critics. The difference between platforms. The changes to the industry over time.

The problem is that we never have any numbers to work with. We just sloppily take our anecdataAnecdotes extrapolated into “data”. and project it onto the industry as a whole. Just about everyone realizes this isn’t a scientific way of going about things, but we don’t really have any alternatives. It’s either guessing based on personal experience, or we chow down on the PR slop the various publishers feed usOr should we read quarterly reports aimed at shareholders, and swallow THEIR slop?.

Do particular DRM schemes impact audience reaction or sales? Do console generations impact PC sales? Do single-player games with tacked-on multiplayer actually sell / score higher than games without those features? Does review-bombing impact sales, or is the practice just a harmless but cathartic way of expressing outrage? It feels like critics and consumers have been drifting apart in terms of what they say about games, but is that perceived gap reflected in the review scores?

I suppose at the root of it was a general curiosity about the decision-making happening at the big publishers. We can’t see what game budgets are, we don’t have access to reliable sales figures, and without those numbers we have no way of even guessing about how much particular games are making or losing. Sites like VGCharts and SteamSpy give us some estimates to play around with, but for the most part we’re stuck in the dark.

However, it seemed like there was some data out there. We can’t answer all our questions, but maybe we can fill in a few more blanks. Wikipedia has a lot of information on game features and developers. Steam has information on DRM and system requirements. And of course Metacritic has the key information regarding critical reception.

So the obvious question is: If there’s a bunch of data available to the public, then why don’t we just round it up? (Preferably without having to do it by hand.)

How Do You Do That?

The process of having a program load web pages and pull out desired information is called Web Scraping. I’ve never written a web scraper before, but I’d always wanted to try it out. It just seems like a fun idea to have a program surf the web for you and bring back a great big haul of information. Maybe, deep down, this project was more about my desire to write a web scraper than to study the resulting data. But this project seemed like a fun way to satisfy both of these curiosities.

As I discovered, the process of building a web scraper is pretty easy. For a project at this small scale, I’d even say it goes from “easy” to “trivial”. All told, this whole project was much less than a week of work. If you handed this project off to someone who knew what they were doing, they could probably finish in a couple of days.

In the old days, I would have done this with C++. But now I’ve spent time time with Unity and learned just enough C# to be dangerous. Since that project I’ve wanted to play around with C# apart from Unity so I could get a feel for what C# is “really” like. The environment that comes with Unity has a ton of game-specific features, and it’s not always clear to a newbie which things you’re using are “standard C#” and which bits come with UnityI’d sort of assumed that Unity-specific stuff would have Unity-specific includes, but it’s also possible Unity comes bundled with some third-party things and conventions.. In Unity projects, the engine controls the loop. Tens of thousands of lines of invisibleInvisible to the game developer. I’m going to assume people working on the engine can see their own code. code might be run before Unity gets around to reaching the bits of the program you’ve written. In vanilla C#, program execution begins and ends with your codeOkay, there’s probably a little bit of stuff the program does that’s invisible to a regular C# programmer, but that’s NOTHING compared to the gargantuan task that Unity does when it creates a window, launches a rendering pipeline, initializes the sound system, loads assets, and a thousand other things., and I wanted to get a feel for how that worked.

The Hardest Thing is Realizing how Easy it is.

The biggest thing that held me back was my learned habits. I’m used to the C++ world where you need to do everything by hand or spend time trying to figure out how to make alien code work with your program. Want to parse some text? Write a text parser. Want to read web pages? You’d better know how to implement your own HTTP stack, including networking, DNS lookups, HTTP requests, and a dozen other things I also don’t know how to do. (Or you could import a library that might not do what you want, or might not have documentation, and might not even compile.)

I kept assuming tasks were going to be hard. I’d get half an hour into writing something from scratch, and then I’d realize there was already a tool for it that was effortless to import and completely intuitive to use. A lot of this project was less about programming and more about learning how to find out what (if any) programming needs to be done.

The best example of this is when I tried to write code to parse web pages. At first I did the naive thing:

  1. If you’re a new programmer that learned to code on a very high-level language with lots of convenience features, then the naive assumption is that there’s a library out there that will do all the work for you, and all you need is to copy a couple of lines of code from StackOverflow.
  2. If you’re a dusty old greybeard with knowledge of the Old Ways and ANSI C, then the naive thing is to assume you’ll need to do everything by hand, painstakingly juggling small blocks of memory and writing dozens of lines of code to accomplish simple things.

I was the second kind of naive. I wrote a text parser that would take the contents of an entire webpage as one big string and look for fragments I was interested in. For example, maybe I’m scraping data from Metacritic and I want to get the title of the game from the webpage. By inspecting the raw Metacritic HTML manually, I’ve discovered that the title of the game is contained in a <div> tag with a class of “gametitle”It’s more complex than this in practice, but this works as an example.. So the HTML code might look like:

<div class="gametitle">Shoot Guy IV: Shoot Harder</div>

So my program downloads the page, loads it into memory, and I have it search the HTML for “gametitle”.  Then I look forward for the nearby closing bracket “>”. Then I’d search for the next opening bracket “<“. In theory, the title of the game should be between those two points.

The problem with this sort of approach is that it’s incredibly fragile. If the website suffers a redesign, then it could lead to chaos in my code. Maybe in the new design, the “gametitle” div is a container for the title of the game, plus the cover image, some publisher info, and some random branding logos. There’s no telling how my parser would handle that, and the odds are extremely high that it would extract a random block of HTML markup / CSS as the title of the game.

I knew this wasn’t the “Right” way to do it, but I was anxious to get the thing up and running before I began learning the “right” way to do things, which I assumed would take a long time.

The next day I came back to the projectAnd perhaps to my senses. and started looking for something to help me parse these web pages. I realized I was going to have to make different parsers for all the different websites I might need to deal with, and rather than making three or four parsers, it would probably be smarter to just bite the bullet and use someone else’s library.

The Lazy Way is Also the Right Way?

This is exactly what it looked like when I worked on this project, except I'm a man, I'm twice her age, I'm not in a wheelchair, I wasn't using a laptop, my office is never this bright, and I'm not a stock photo model. Okay, so this picture has nothing to do with the project. I just wanted to break up this wall of text.
This is exactly what it looked like when I worked on this project, except I'm a man, I'm twice her age, I'm not in a wheelchair, I wasn't using a laptop, my office is never this bright, and I'm not a stock photo model. Okay, so this picture has nothing to do with the project. I just wanted to break up this wall of text.

As an old-school C / C++ programmer, my expectation is:

  • Spend ages going through a half dozen similar libraries. Some are in production but incomplete. Some are more complete but were abandoned a decade ago. Some seem more or less complete but have very little documentation in English. Spend a couple hours trying to figure out which of these seems like the least bad, and then download it.
  • Spend ages trying to figure out how to get this to compile, because there are a dozen ways to do this and everyone thinks their method is obvious / optimal.
  • Read the docs and figure out how to use the damn thing. Spend hours incorporating it into my code.
  • Discover that this library lacks some obvious, fundamental feature and I’m going to need to do some ugly workaround to fix it.
  • Get frustrated and disillusioned. Tell myself I’ll try one of the other libraries tomorrow.
  • Shelve the project and never come back to it.

That’s the workflow I’m used to for hobby projects. Here is what I actually experienced while working on this project:

  • I spend two minutes searching and discover that just about everyone uses Html Agility pack. It promises to do everything I need and it doesn’t appear to be abandonware.
  • I’ve never used an external library in C# so I have to endure a 5-minute learning curve to figure out where you go to do this. It turns out there’s a handy package manager, like they have in Linux-land. Once I know how to find it and talk to it, the process is completely seamless. It downloads the code and I can start using it right away.
  • I read the docs and realize I barely need them. Everything is pretty straightforward.
  • I discover that Html Agility pack contains far more features than I realized. Not only can it parse HTML for me, but it can fully understand the HTML and do complex searches for me. With one line of code I can do a complex query like, “Find the first element with the class of “gamelist”, then find the first <OL> element inside of THAT, and then return an array of all of the <LI> items inside of it.

Even though I didn’t know anything about the library, I didn’t know how to obtain and use libraries, and wasn’t sure what I was doing, this way was faster and easier than what I did yesterday. As a bonus, it’s way less code. Yesterday’s parser code was about a page long. This one is less than a dozen lines of code.

I feel vaguely guilty. I feel like a gardener who’s been shoving around a manual push reel mower for his entire career and now I discover someone has been giving away free riding mowers for the last 20 years. I don’t know if I feel guilty for using this decadently easy system, or if I feel guilty that I spent two decades of my life breaking my back with this ancient hunk of metal when easier alternatives were free for the taking. Maybe somehow I feel both kinds of guilt at the same time.

The other thing that made this trivial is that my performance requirements were incredibly lax. If this program was going to be running at scale on a dedicated server, then I might need to worry about efficiency. Maybe I’d need to watch the memory footprint, or do something with multiple threads, or whatever. But this program was going to use my mid-tier residential internet connection with a single IP address. Network throughput will always be the bottleneck in that setup, so any other optimizations exist only as amusements to gratify the programmer’s particular obsessions or passions. You can optimize that text parser until it runs like Carmack-level assembly code, but it’ll never make the program faster in a way that will be detectable to the user.

Next time I’ll talk about what the scraper is actually doing. If you thought this one was boring, just wait until I start talking about databases.

 

Footnotes:

[1] To the genuine annoyance of some.

[2] Anecdotes extrapolated into “data”.

[3] Or should we read quarterly reports aimed at shareholders, and swallow THEIR slop?

[4] I’d sort of assumed that Unity-specific stuff would have Unity-specific includes, but it’s also possible Unity comes bundled with some third-party things and conventions.

[5] Invisible to the game developer. I’m going to assume people working on the engine can see their own code.

[6] Okay, there’s probably a little bit of stuff the program does that’s invisible to a regular C# programmer, but that’s NOTHING compared to the gargantuan task that Unity does when it creates a window, launches a rendering pipeline, initializes the sound system, loads assets, and a thousand other things.

[7] It’s more complex than this in practice, but this works as an example.

[8] And perhaps to my senses.



From The Archives:
 

90 thoughts on “Scraping Part 1: Easy Mode

  1. Pink says:

    I dunno what counts as ‘exciting’ really, but this article at the very least tempted me to give something along similar lines a try.

  2. Infinitron says:

    My goodness, why would you not do something like this with Python

    1. JustAnotherProgrammer says:

      My goodness, why would you not do something like this with Java

      1. Mikko Lukkarinen says:

        My goodness, why would you not do something like this with PHP?

        …On second thought, never mind.

        1. tmtvl says:

          My goodness, why would you not do something like this with COBOL?

          Let’s keep the trainwreck rolling.

          1. Decius says:

            My goodness, why would you not do something like this in assembly?

            1. pseudonym says:

              My goodness, why would you not do something like this in Vimscript?

              1. LCF says:

                My goodness, why would you not do something like this in Malebolge?

                1. Pylo says:

                  My goodness, why would you not do something like this using a butterfly?

              2. Mousazz says:

                My goodness, why would you not do something like this in Brainfuck?

    2. CJK says:

      Because you don’t want to learn Python?
      Because you do want to learn C#?
      Because you already know enough Python to know that you don’t get along with Python?
      Because you know Python, like Python, but want to learn a new language and this is just a fun project?

      I feel like the article sufficiently explained that it was reason (2), but I really do feel like there are lots of reasons, and that it’s not at all obvious that Python is the “best” for the job anyway.

      (Personally I want to like Python, but I can’t get on board with the mandatory indentation scheme. I like to use a lot of whitespace, but I don’t necessarily want to use it where Python says I should. Brackets for program flow, please, whitespace for me, the human, at my discretion)

      1. John says:

        I’ve been dabbling in Python a little lately, and the whitespace thing is kind of annoying. I cut a block of code that had been in some function or loop or something and pasted it elsewhere. Suddenly, my program didn’t work any more. I thought it was a scope problem at first and so I spent several fruitless minutes trying to figure out what hadn’t been properly defined or initialized. Nope. The code was fine. It was just indented one time too many. Urgh.

        1. Erik says:

          Indentation is an adjustment, but I’ve found that it’s seamless after the transition period. Once you get used to looking for improper indentation, it’s no harder than looking for a missing brace (and in some bracing styles, a lot easier).

          Modern libraries are an amazing transition for old-school coders like me and apparently Shamus. It’s in large part because modern languages are almost all not pre-compiled to the target hardware, just pre-compiled down to an intermediate byte code that can be run by some form of runtime execution engine: JVM, .Net Framework, Python runtime, embedded Lua interpreter, etc. Not binding at compile time may make you vulnerable to updates in your chosen services, but it means so much less hassle and cruft in actually incorporating it into your system. You don’t have to compile it yourself and hope that you and the author have compilers with compatible options (or you’re debugging compilers *shudder). You can just say “import numpy” or “package require binconvert”, and you’re running.

          They are really transformative in the programming process, and I get to see this more closely than most because I work in embedded systems. For my PC-side tools that communicate with my target, I use Tcl/Tk or Python to throw together diagnostic scripts and GUIs, grabbing libraries with abandon. For the target, I have to hand-craft C code that will fit all the functions, UI, hardware drivers, tests, and diagnostic features into under 48K. No library will help me there (though a robust internal design system helps immensely). I can’t roll any extra features in – I’ve been running with under 200 bytes free for the last 6 months, and down to 34 bytes free on the last release. It does help feature creep amazingly when the first question you ask when they ask for a new feature is “what do you want removed for this to fit?”

          1. John says:

            I come from a Java background. Swing is well-documented and Oracle has an excellent, extensive, and easy-to-find set of Swing tutorials. I wish I could say the same about Tcl/Tkinter. Did you know that Tkinter is case sensitive? I don’t mean the contents of the library, I mean the title. It’s “Tkinter” if you’re using Python 2 but “tkinter” if you’re using Python 3. As far as I know, this fact is not documented anywhere. I discovered it completely accidentally after a morning of wondering why my import statements weren’t working when I followed the tutorial exactly and Tkinter is supposed to be packaged with Python. I spent who knows how long uninstalling, redownloading and reinstalling Tkinter before I stumbled on to the truth. Then there’s the way that Tkinter handles on-click events for buttons and . . . argh! I don’t want to get into it. May it suffice to say that if I have issues with Python then they are mostly Tkinter’s fault. The indentation stuff doesn’t compare.

    3. Echo Tango says:

      1. Non-statically-typed languages let you make mistakes that you’ll only discover at runtime, and which will require you to pull your hair out to debug them. In trade, you get to avoid declaring your types, interfaces, etc, which is straight-forward.

      2. You lose the ability to have your code auto-formatted, so everything either looks like garbage, of you waste time manually formatting it, for the trade-off of just having some brackets around code-blocks.

      3. The whole ecosystem of libraries is filled with deeply-inherited, tightly-coupled code, which is a mess to try to understand. If they gave you a strict interface, you could ignore that as the library-maintainers’ problem, but they don’t, so you can’t. Now you’re dealing with #1 here, too.

      Python is a mess. Stay the hell away.

      1. Chad Miller says:

        re: 2 – https://github.com/psf/black

        You could argue that this can’t autoindent because indentations are syntactically relevant, but the counter-counter argument is that with a braced language you still have to manually tell the language what to indent, but with the added cost of having redundant markup on top of the indentation that would already be there.

        1. Richard says:

          The counter-counter-counter arguments are basically:

          I also want to indent things that aren’t program flow and aren’t scopes, because it makes it easier to read if X lines up vertically with Y.

          I want to stuff the trivial one-liner functions { all on one line }.

          On that note, I quite like C#’s new “get set” idiom tho automatically generate the trivial getters and setters. I am quite sad that it’s not available in the dialect of C# currently used by Unity 2018 (latest LTS release)

          1. Chad Miller says:

            I also want to indent things that aren’t program flow and aren’t scopes, because it makes it easier to read if X lines up vertically with Y.

            I can’t think of an example of this ever happening to me but I can believe it occasionally happens. Generally speaking though, you can indent arbitrarily in the middle of things like lists, tuples, dictionaries etc. (that won’t work with autoformatted code, but I’d also have to wonder how you’d reconcile “autoformatted code” with “arbitrary indentation” in any language)

            I want to stuff the trivial one-liner functions { all on one line }.

            You often can. Python has semi-colons; they’re just not a popular feature. But there’s nothing stopping you from writing:

            def f(x): c = 2 * x; return c + 3 # I know, silly example

            I generally only do this sort of thing in the REPL but it’s valid in actual code also.

        2. Echo Tango says:

          Two extra brackets is a very, very small price to pay, compared with the hassle of indenting every single line that’s inside those brackets. As noted elsewhere in this thread, you have to pay this price every time you copy some code from a differently-indented place in Python, but the price only gets payed once in a bracketed language.

          1. Chad Miller says:

            Barring some situations that required me to edit HTML/JS in Notepad, I think every editor I’ve used in the last decade or so has a key combination for “indent or dedent the code I have highlighted”

            1. Echo Tango says:

              You need to hit that key combination for every block you’re moving, for every level of indent. Golang’s auto-formatter does an entire file, completely, not just one piece, not just one level of indent.

              1. Echo Tango says:

                Sorry I mis-spoke, it does all your files.

              2. Chad Miller says:

                You need to hit that key combination for every block you’re moving, for every level of indent.

                Nope! If I decide to take some code and stuff it in a while loop, the steps look like:

                * Type “while:”
                * highlight the code I need to indent
                * press Tab once

                (just double checked it in both notepad++ and VS Code which are all my casual self has used in the last year or so)

          2. Philadelphus says:

            In my experience, copying large amount of differently-indented code is a good reminder to stop and think about what you’re doing; is there a better way to accomplish what you want without copying lots of code around and repeating yourself? Perhaps you could do this in a more Pythonic way? Maybe write a new function or class?

            Ok, sometimes it’s unavoidable. You’ve decided to move something in or out of a loop, or otherwise need to change the indentation. As pointed out, many IDEs are aware of this and a simple Tab or Shift-Tab on the entire highlighted section will fix your problem (as many can guess at the level of indentation needed). That’s fewer keystrokes needed than for a pair of brackets.

            Finally, this is an incredibly minor “price” to pay for having readable-at-a-glance scope. Trying to chase down brackets that have gone off the screen gives me a much bigger headache any day than having to, on occasion, remember to hit Tab. To this is very much making a mountain out of a molehill.

            For reference, I’m three years into a PhD in astrophysics written entirely in Python. I’ll hardly claim to be an expert, but I’ve been making a living off of it for several years. You may disagree with its design choices—and that’s fine—but don’t try to make it out to be some unusable mess when it most definitely is not.

          3. Richard says:

            Having just spent a day in Python…

            – Moving a block of code from one scope to another is really painful, and certain to result in error when it contains sub-scopes, eg, moving this lot:
            if:
            if:
            do_stuff
            else:
            do_other_stuff
            else:
            do_third_stuff
            always_do_this

            – It is extremely difficult to spot the end of a scope and match it up with the beginning, because there’s no explicit marker saying “Done!”.
            When there’s a gap between, maybe the next contentful line is indented differently, maybe it isn’t.

            A generic smart text editor can highlight begin/end tokens, but it can’t help me at all with Python.

            I assume a Python-enabled IDE can help with both of these things, but it means it needs to actually understand Python rather than the ‘pair the tokens’ system that works for pretty much everything else.

            1. Chad Miller says:

              It is extremely difficult to spot the end of a scope and match it up with the beginning, because there’s no explicit marker saying “Done!”.

              I’ve noticed that Python style guides recommend indenting 4 spaces while most dynamic languages consider 2 spaces to be the standard. I suspect this is why.

              I assume a Python-enabled IDE can help with both of these things, but it means it needs to actually understand Python rather than the ‘pair the tokens’ system that works for pretty much everything else.

              As per my comments elsewhere in this thread, if the code was already indented properly before you need to move it a lot of editors are able to handle “indent this arbitrary block of code one level further” which doesn’t require anything specific to Python.

              1. Richard says:

                And one level less?

                TBH I really thought ‘meaningful whitespace’ had been universally accepted as a Bad Idea by now, if only because there’s so many different types of whitespace.
                Tabs (0x09) vs Spaces (0x20) is just the beginning…

                However, changing Python at this point would be an even worse idea.

                1. Chad Miller says:

                  And one level less?

                  Highlight, shift+tab

                  TBH I really thought ‘meaningful whitespace’ had been universally accepted as a Bad Idea by now, if only because there’s so many different types of whitespace.

                  I mean, as I mentioned elsewhere, there are ways to lessen Python’s dependence on indentation level but the fact that these methods go almost entirely unused implies that users generally don’t consider it a problem. (I’ve noticed a similar dynamic in Haskell, though that language is unpopular for entirely different, yet valid, reasons)

                  Tabs (0x09) vs Spaces (0x20) is just the beginning

                  Most Python projects and style guides straight up ban the tab character, with the interpreter even turning tab+space mixing into errors: https://www.python.org/dev/peps/pep-0008/#tabs-or-spaces

            2. Leeward says:

              I’ve been using python for almost 20 years, and I’ll agree it’s got its fair share of problems, but significant indentation doesn’t even show up on the list. Use a decent editor (or vim) and you won’t have issues.

              The main thing is not to try to make python into something it’s not. Need a fast (to write) script to run once then throw away? Python’s great. Need to write a mission critical piece of infrastructure or something to run on a memory constrained system? Pick something else.

              A web scraper is a great thing to write it Python. My current project at work is about 4k lines of Python, and it’s bumping up against the limits of what I want to do in Python.

              On the other hand, I’ve never thought to myself that a project was getting too big for C.

    4. ElementalAlchemist says:

      Because Python is a tool of Satan and everyone that uses it should be shot into the Sun?

  3. Adam says:

    If your doing web scraping, please be a nice citizen and follow robots.txt conventions. In particular, don’t hit web pages as fast as you possibly can because its surprisingly easy to crash and/or denial-of-service smaller webservers and sites.

    1. Echo Tango says:

      That only matters for small websites. Shamus is crawling big beefy corporate sites, which would withstand his scraping trivially. The bigger danger is that Shamus gets robo-detected and blocked[1], because they don’t want him stealing their data. :)

      [1] A quick googling shows me these two companies offer help with bot-detection. I’m sure I could find more, if I knew the correct industry terms. ^^;

      1. Paul Spooner says:

        I think he’s also built a random multi-second delay between requests, so he doesn’t get IP blocked.

        1. Echo Tango says:

          Some blockers also look at how you’re getting the data, not just how frequently. So for example, curl or other command-line tools, running some browser in headless mode or in a VM. They can detect your browser window size, version, plugins, and other things to fingerprint you, and differentiate you from normal traffic. It’s a pretty sophisticated game of cat-and-mouse nowadays between people trying to shut down scrapers[1], and the people running scrapers.

          [1] Wrongly presumed, or genuinely a bad actor.

          1. Richard says:

            An approach that’s rather doomed because any scraper can trivially pretend to be any browser with any set of plugins on any size ‘monitor’ desired.

            Or indeed, the scraper can actually be Chromium. Just with no actual real display. Or human.

            I suspect the ‘stop the scrapers’ techniques probably mean that none of the web crawlers read robots.txt anymore – because a ‘real’ browser won’t do that.

            1. epopisces says:

              How you handle scraping is industry dependent of course, but in many cases the worst issues stem from ‘scrapers 2.0’: automated shopping bots which buy up high demand product (concert tickets, limited release Adidas shoes, etc) to profiteer on the secondary markets, preventing legitimate human customers from buying direct.

              It’s one of the reasons for Ticketmaster’s dominance, it employs a WAF (web application firewall) with anti-bot protections that many small venues can’t afford.

              ‘Bot fingerprinting’ is sophisticated, and depending on the level of protection can even use things like mouse cursor tracking to detect bot activity. In many cases you aren’t trying to catch every bot: just the most popular/prolific ones. Some humans with matching characteristics will unfortunately get blocked (or sent to a ‘sorry site is too busy to handle your request’ or similar), but it really is a tradeoff decision. Cat and mouse is a good way to put it, as Echo Tango said.

              1. Richard says:

                The automated buy-all-the-tickets bots can really only be defeated off-line, by removing the secondary market.

                If the touts can re-sell the tickets at huge markup, they can spend a lot of time, effort and money on making bots. Or they can hire Mechanical Turk to do it, which will always break through.

                If the touts can’t re-sell the tickets, they won’t bother running the bots that make it impossible for real fans to buy tickets.

                Before the current ‘situation’, a lot of the large tours were trying to do that.
                The ‘simplest’ is that all tickets are for a specific person, which must be specified at purchase.
                If you can’t go, you hand the ticket back to the event, and they resell it at face value to another fan.
                Once it’s resold they refund you.

                One interesting way I saw of doing that was that you simply upload a photo of yourself when buying the ticket, and they check the photo in their database matches the ticket holder at the event gates.
                I thought that was quite good – it’s easy for fans, you can still buy a ticket for any other person you want and it’s minimum-knowledge – they don’t need a real name or any external info at all to verify that the ticketholder is (very likely) the person who the ticket was bought for.

  4. Steve C says:

    Surprisingly, manual push reel mowers are much easier to use than a ‘regular’ lawnmower. And I do mean physical effort. It’s kind of amazing how much easier given initial impressions. Well… if you use a good one (aka an old solid one before plastic). They are more like using a push broom than a lawmower.

    Don’t knock it until you’ve used one. The perfect tool for a small lawn. I imagine it’s like code. Certain languages are perfect for certain tasks.

    1. raifield says:

      I made this same post before I saw yours. I bought a Fiskars push mower years ago. Haven’t regretted it.

    2. John says:

      I’ve got one of those. It’s pretty nifty. I’d hate to have to mow a large lawn that way, but it’s perfect for my small city lot. Possibly my favorite thing about it is that it’s so light that I can pick it up and carry it. When I need to go from the front to the back yard (or vice versa) I have to go up and down some narrow concrete steps and being able to lift and carry the mower is extremely convenient. I don’t know what I’d do if I had to get over those steps with the kind of gas-powered mower I used on my family’s lawn when I was a teenager.

    3. Lino says:

      I think it all depends on the person. I have a friend who’s tried all manner of mowers – both electric and otherwise, yet he always goes back to using his scythe, and he says that it’s the best tool for getting an even cut.

      1. The Puzzler says:

        Does he have a black cloak and a strangely pale face?

        1. baud says:

          And does he talk IN ALL CAPS? Even if he’s always awfully nice.

          1. tmtvl says:

            If so, ask him to tell Terry we all love him and miss him very much.

            1. Lino says:

              Will do. While I’m at it, I’ll ask him how Susan is doing.

    4. Echo Tango says:

      They’re easy, but they don’t handle tall grass. At least, the ones like Shamus linked, with the rotating motion; The center of the cylinder is too short. For years, I’ve been going past the mower-isle at hardware stores, every time I needed to buy something else, hoping to find a push-mower that had a mechanism like a swather on a farm. :S

    5. Erik says:

      I grew up using those. You are only correct for a reasonably strong man, doing a reasonably small yard, reasonably frequently maintained.

      As a 10-year-old boy doing a very large lawn, the push mower was a demon from hell. The electric was not miles better, but at least I wasn’t throwing all of my 75 pounds against the mower, fighting to keep the blades turning against thick wet grass.

      1. Steve C says:

        Huh. Interesting. I also formed my opinion based on my experiences as a 10 year old boy as I was using my grandfather’s. I might have been younger. Maybe you didn’t have a good one?

        I’ve used crap push reel mowers and I’ve used good ones. The difference between the two are like night and day. A crap one locks the blades to the wheels. IE Wheels turn, blade spins. Wheels stop, blade stops. A good one both gears up the blade plus allows it to freely spin. IE Wheels turn, blade spins much faster than the wheel. Wheels stop, or are dragged backwards, blades continue to spin fast. The difference between the two is that you are fighting the grass directly with the first type. The second type you bring the blade up to speed on grass that has already been cut and it wants to continue into uncut grass on it’s own flywheel power. So on very heavy grass you sweep it back and forth in short motions like broom. Which sounds like a lot of effort, but is surprisingly easy if the mower is properly maintained.

        To be fair, I haven’t seen a good push reel mower in decades. They just don’t make them like that anymore. However I wouldn’t want to use one on a big lawn for the same reason I wouldn’t want to use a push broom on a big parking lot. The crap ones can die in a fire.

    6. Thomas says:

      I had a friend who looked at me like I was crazy for suggesting he get a push lawn mower. But the truth is, if you’ve got a small city / town garden and are reasonably fit, they’re often the best solution. Less bulky and you don’t have to mess about with cables and plugs, and good storage isn’t as critical.

  5. Abnaxis says:

    When I did something similar like 15 years ago (wife was doing a paper on FOSS user groups, basically sorta kinda looking at the language they used to describe their groups and classify them along idealogical lines) I used a DOM parser library in Python.

    It was the work if an afternoon to get a basic version running and the tool had all the capabilities you’re espousing, because DOM parsers have basically excited as long as XML has.

    All this to say: I’m not 100% sure how much of your experience can be attributed to C# and how much can be attributed to the fact that programmers have been parsing markup damn near as long as we’ve been sorting lists. It makes a huge difference when you’re not stuck in “specialized hobbyists game-dev and graphics” land no matter what language you use.

    1. Richard says:

      To be fair, C++ has about fifteen or twenty popular DOM parsers, which immediately causes the “which one do I USE!?” problem.

      ‘Cos if you pick wrong and change to another, you’ll have to redo a lot, and you won’t know that you picked wrong until you’re quite far down the road.

      There’s also some toolkits based around Chromium itself – or one way to do this kind of thing is to write a Chromium plugin and let Google deal with the whole ‘getting the data and building a DOM’ thing.

      But they’re all orthogonal to the “Learn C#” problem.

      1. Abnaxis says:

        …I mean, how do you get a DOM parser wrong? Just pick the one from the 15-20 with the coding style you like and run with it

        And yeah, I see no problem with using a toy project like this to learn C#. I’m just suggesting a bit of skepticism before using this project as a proper gauge for how easy it is to make “real” code in C#. No offense to people who do it for a living, but I wouldn’t count web parsing as a “real” project.

        1. tmtvl says:

          I would highly recommend you watch r0ml Lefkowitz’ talk “Literacy: the Shift from Reading to Writing”, it may change how you think about these kinds of “toy programs.”

          1. Abnaxis says:

            Erm, did I imply something I didn’t mean to? How did you get anything about “how I think about toy programs” from my post?

            For the record, I think projects like this are fine to familiarize yourself with dev tools and maybe pick up a skill here and there even if you’re probably re-writing the same piece of code that’s been put to silicon a million times by now. I’m just cautioning against extrapolating those same lessons too far to generalizations about the language overall.

  6. Robert Conley says:

    Visual Basic 6 had some of this with it numerous libraries and controls. However Microsoft with C#, VB.NET, and .NET Library really took this to the next level. Now that we are nearly 20 years in the variety of different libraries and what they can do is amazing.

    Before he passed away my father wanted a word puzzle solver. He played these contest where you have a grid of letters and a list of words that you had to find. I made a program using VB.NET and .NET framework to create a dialog to display a grid that he can type in the puzzle and the words and it would find them all for him.

    Also he bought and sold penny stocks so I wrote another program that used the Google finance program to pull down the data he wanted and crunched the numbers with a formula he wanted to use.

    Both were far easier than what I had to deal with C in the early nineties or VB6 in the late nineties.

  7. raifield says:

    My wife made fun of me when I purchased a new Fiskars manual reel mower a few years ago. I figured it would be good exercise, far quieter than a motorized mower, and over time, cheaper.

    Well, I was right. They’re great for small yards and not difficult to push or use.

  8. Lino says:

    I’d talk you out of reading more, but we both know you’re going to read this stupid thing no matter what I say.
    So Let’s just get this over with.

    Oh, Shamus! You know us so well!!!

    Also, typolice:

    hour into writing something by scratch

    Should be “from scratch”

    If you thought this one was boring, just wait until I start talking about databases.

    I can’t wati!

  9. ydant says:

    Welcome to the web scraping rabbit hole.

    Unfortunately, you’re entering in the post-Web2.0 world. Before 2.0 was simple, everything was basic HTML and super-simple. Web 2.0 was mostly easy – web developers actually strived to be easy to work with and “standards” evolved around consistent and predictable data access. But now we’re in the post-2.0 scraper hostile web. Web developers are actively working against you – either in the name of convenience (dynamic frameworks) or active anti-scraping technology.

    Future rants to include:

    CAPTCHAs
    JavaScript/dynamically loaded/rendered content
    Dynamically changing HTML structure / class names / IDs
    Rate limiting / IP blocking

    Then you’ll start down the path of using an embedded browser. Then you’ll figure out those tend to be a pain in the ass to interact with and unstable.

    But doing this in a game engine? That’s cool (if a bit… weird). You could write a game / 3D progress monitor around the scraping.

    Agreed with the other comment – please try to respect robots.txt and rate limit yourself. Overly aggressive bots certainly haven’t helped with building goodwill between the scrapers and the scraped.

    1. Shamus says:

      To be clear, I’m NOT using a game engine. Like I said in the post, I wanted to experience C# away from Unity, so I started a project that was just vanilla C#.

      1. ydant says:

        Ah, I see now. I read too quickly and misunderstood that paragraph to be you explaining why you used Unity (as in it made it quicker to get started by abstracting away boilerplate code). Of course, it says the exact opposite, and I clearly need to work on my speed-reading comprehension.

        I did a quick search, and it looks like you wouldn’t be the first person to do web scraping in Unity, so there’s still hope for a web scraping game.

    2. Echo Tango says:

      Just use mechanical turk! :D

      1. Kyle Haight says:

        That’s part of one of my answers to the somewhat offbeat interview question “What’s the worst sorting algorithm you can think up?”: Mine some bitcoin, use it to hire a person via mechanical turk, have the person sort the list.

        1. tmtvl says:

          Well, it’s more consistent then bogosort, so it’s not entirely terrible.

  10. John says:

    This has been a learning experience for me as well. When Shamus posted his Metacritic data set, I wrote a Java program to to read and interpret the file. Text parsing is not something I do very often or think about very much. It’s hard. Even Shamus’s fairly consistently-formatted data file had a few quirks and inconsistencies that threw me for a loop. The file had comma-separated values, but some of those values were strings with commas in them. There were also a few records that were split over two lines. I came to the realization that I could either spend who knew how long writing code to handle those relatively rare cases or I could spend much less time manually editing the data file to remove the offending commas and unwanted carriage returns. I opted for the latter. I also learned how to incorporate external libraries–in this case, statistical routines from Apache Commons–into my Java projects. (I’ve technically done that before with libGDX, but I’ve always let the libGDX setup tool do all the work for me. This was the first time I’ve ever done it manually.) It turned out to be pretty straightforward–download the jar file, add the jar file to the classpath, import classes as necessary–and much less scary than Shamus’ stories have always made it sound..

    The limited amount of statistical testing I did before I got distracted by other things turned out to be inconclusive, by the way. I computed the difference between the average reviewer score and the average Metacritic user score for each game in the data set. Then I looked at the distribution of the difference for each year, using a modified Pearson’s statistic to test that the distribution was normal (i.e., Gaussian, i.e., that it had the expected bell curve). In some years, the distribution appeared to be normal. In other years it did not. There was no obvious pattern. Given that so many of the years had non-normal distributions, the usual parametric tests for changes in distribution over time aren’t appropriate. (There will be no F-tests, for those of you who know what those are.) Non-parametric testing is the obvious next step, but I just haven’t gotten around to it yet.

    I have, however, done a bit of data visualization. I computed a histogram of the difference for each year, color-coded the cells based on relative frequency, and compiled the results in this handy-dandy graphic. A bright cell contains a lot of games. A dim cell contains very few games. I computed the difference by subtracting the average Metacritic user score from the average reviewer score, so, loosely speaking, a negative number means that on average Metacritic users liked a game more than reviewers and a positive number means that on average reviewers liked a game more than Metacritic users. Please note that a cell’s color does not correspond in a linear fashion to the relative frequency. I applied a square-root transformation to make things more visible. Almost all the cells were very dim before the transformation. (If anyone would like the raw histograms, just let me know.)

  11. Cilba Greenbraid says:

    You might be on to a real idea for a schtick there, Shamus: Going Full Lemony Snicket. PLEASE READ SOMETHING ELSE.

    You’ve already carved your niche as The Entertaining Nitpick Guy; this is The Internet, so you might as well crank it up to 11. :)

    For the record, *I* thought this was an entertaining post.

  12. Retsam says:

    So my program downloads the page, loads it into memory, and I have it search the HTML for “gametitle”. Then I look forward for the nearby closing bracket “>”. Then I’d search for the next opening bracket “<“. In theory, the title of the game should be between those two points.

    At least you didn’t try to parse HTML with regular expressions.

    1. Echo Tango says:

      Yeah, leave the regexes to the people writing the HTML-parsing libraries! :P

    2. Duffy says:

      I have a problem, I bet I can solve it with regex!

      Now I have two problems.

    3. tmtvl says:

      POSIX-style regexes are very limited and awkward to work with. PCRE-style regexes are fine, but kinda clunky and dated. PSIX-style regexes are amazing and wonderful and can do anything.

  13. Vertette says:

    “It’s either guessing based on personal experience, or we chow down on the PR slop the various publishers feed us”
    Now as a special treat courtesy of our friends at EA, please help yourself to this slop!

    Either way, having this data is actually pretty interesting. Either you learn something new or your worst fears are reaffirmed. Win/win, really.

  14. King Marth says:

    I highly recommend Feed43 for anyone using RSS feeds and interested in web scraping. You plug in a URL and specify your pattern, and the service generates RSS entries with whatever matches that pattern as the page content changes over time. Very useful for creating notifications for arbitrary websites, I first found this when tracking small webcomics.

    You won’t get any huge data sets from this service, but it’s a fun way to play around with pattern matching with a very low barrier to entry.

  15. Erik says:

    I think “anecdata” may be my new favorite neologism. I must steal this.

  16. Ander says:

    As one of the young naive programs who expects finding an HTML parser for C# to be easy, this post helped me understand why Shamus is so often upset about third-party libraries.

    1. Thomas says:

      Same! As someone who only uses things like R and Python for basic data science / stats projects, I’ve always found libraries to be wonderful things.

  17. baud says:

    Interestingly, I’ve just modified an open-source software the other week that used Html Agility pack. Apparently it wasn’t parsing part of an HTML document/not loading in the DOM part of it and since it was where the info I wanted was located, the software wasn’t working. In the end, I just did a quick and dirty hack, forcing the Html Agility pack to reparse the text and it worked, this time loading the content I was interested in. It’s true that HAp works really well (barring the issue I found) and has a lot of available functionalities.

    It was also the first time working with C#, but coming from Java, it wasn’t hard and since it was a quick fix in an existing project, the scope wasn’t enormous. And I already knew how to use Visual Studio.

  18. tmtvl says:

    Welcome to the world of 1995, when CPAN revolutionized the world by bringing all programmers everywhere together.

    1. Echo Tango says:

      I used Perl at one job, for DNA-sequencing computation stuff. It’s pretty decent! :)

      1. tmtvl says:

        I think this is the part where I do evangelism for Raku, but I don’t think this is the right channel for that kind of stuff.

        Still, between CPAN (for Perl), Crates (for Rust), Guix (for Guile), and Zef (for Raku), I don’t think I’ve ever had trouble getting any libraries working.

  19. Innards says:

    Some of the sites might have public facing APIs you could query for the data as JSON directly instead of scraping it. IMDB has a famous one https://imdb-api.com/API

    Unsurprisingly, Javascript also has a lot of libraries and methods for parsing DOM content and using headless browsers for scraping, like Puppeteer.

    Not trying to imply you should have learned a whole new language instead of experimenting with raw C# you were curious about anyway, just throwing it out there because I hadn’t seen that option mentioned in the comments.

    1. Retsam says:

      Yeah, it’s good to mention the “DOM scraping” approach, because it’s fundamentally different than the “HTML scraping” approach that Shamus is using, and it’s basically necessary for dealing with more complicated websites.

      HTML is a text document which describes the initial shape of the website (i.e. a recipe), while the DOM (document object model) is actually the data structure that’s built from the HTML, which can then be dynamically changed by JS code: any interactive parts of the UI, (and sometimes, the parts that come from a database) are often not actually in the HTML code but are instead built at runtime by JS.

      So “HTML scraping” downloads a text document and parses the format of that document into a static, (“DOM”-like) structure. While DOM scraping, the program actually opens the website in a browser engine and can interact with the website like a user: so it can do more complex stuff like “load this page, type in this search query, wait for the page to fill in the results, then process them”.

  20. Lars says:

    If C# is that easy I can understand how Minna Sundberg, a total beginner in programming can get so far with her jrpg game within just a few month. And that as a side project in addition to her webcomic.

  21. Narida says:

    All I could think was welcome to programming in the 21st century, Shamus.

  22. Sleeping Dragon says:

    So here’s something I took from this as a non-programmer.

    Hey kids, see how easy it is to make these things? This is why you don’t put anything sensitive on your blog, tumblr or facebook page thinking “eh, what could happen? It’s not like Evil Haxxorz are gonna be checking every random blog for this stuff”.

  23. Scerro says:

    A quick note from a DBA that ends up getting dragged into web stuff:

    Depending on who you’re scraping, they might really, really hate you. We block web scrapers sometimes that go overboard on their request numbers.

    1. Sven says:

      At the very least, anyone writing a scraper needs to obey robots.txt if present.

  24. ChaoticBlue says:

    If you thought this one was boring, just wait until I start talking about databases.

    You sure know how to get a man excited. :)
    I can’t wait.

  25. Sven says:

    I’m curious, are you using .Net Framework or .Net Core for this? It probably won’t matter much for a small project like this, and most .Net Framework stuff will transfer to .Net Core, but if your goal is to future-proof your C# skills, .Net Core is the way to go.

    1. Shamus says:

      It’s alarming how often I find myself saying “I don’t know” these days.

      In this case, I didn’t know the distinction existed and I don’t even know how to find out.

      Am I going backward? Is this blog just going to end in 2032 with me writing Hello World in Tandy Basic?

      1. Sven says:

        If you don’t know, you’re probably using .Net Framework, and as long as you’re writing software for Windows, the distinction doesn’t matter that much.

        .Net Framework is the original version of .Net. It’s only for Windows, and various versions have been bundled with Windows over time. The latest version was .Net Framework 4.8, and MS is really only releasing security updates for it at this point.

        .Net Core is a new, open source, cross-platform version of .Net, written from the ground up. It natively supports not just Windows but also Linux, MacOS, etc. It has a more modern project management system, and is more friendly to command-line development (.Net Framework can be used without Visual Studio, but it’s a hassle). This is where all the new development in .Net is going these days. The latest version is .Net Core 3.1.

        .Net Framework and .Net Core are getting closer to feature parity, which is reflected in the fact that the next major version of .Net Core will be called “.Net 5”, dropping the Core moniker.

        There’s also a thing called .Net Standard, which is a formal specification of .Net APIs. The reason for its existence is that .Net Framework and .Net Core aren’t the only .Net implementations out there: there’s also Unity (as you know), Mono (an older, third-party open source cross platform implementation of .Net), Xamarin (a tool for building apps targeting multiple platforms like iOS and Android), and UWP (Windows 10’s store app format). By targeting .Net Standard, .Net libraries can be used by all of these. For example, Html Agility Pack comes in a .Net Standard version.

  26. paercebal says:

    Welcome to the world of C#.
    :-)

    As a C++ programmer at heart, I can say that C# has all the sweet spots, unlike its older, redneck-er Java sibling. The team who designed it was led by Anders Hejlsberg, who has spent his life designing successful languages/environments. And it shows in C#.

    His interview on artima (https://www.artima.com/intv/anders.html) is eye-opening. As is the “Framework Design Guidelines: Conventions, Idioms, and Patterns for Reusable .NET Libraries” book, if you’re into that sort of thing.

    On the more practical side, if you’re working on Windows, using Visual Studio, I strongly encourage you, once you tire of the console, to try GUI programming, using WPF: This strange cousin to HTML has all the power of DirectX behind it, and being able to style regular comboboxes as a 3D flashing cards hand just with a bit of styling, and no performance hit, is AWING. Did I mention the baked-in 3D objects like vectors, viewports, etc.?

    Anyway, welcome again… You’re in for a treat.
    :-)

Thanks for joining the discussion. Be nice, don't post angry, and enjoy yourself. This is supposed to be fun. Your email address will not be published. Required fields are marked*

You can enclose spoilers in <strike> tags like so:
<strike>Darth Vader is Luke's father!</strike>

You can make things italics like this:
Can you imagine having Darth Vader as your <i>father</i>?

You can make things bold like this:
I'm <b>very</b> glad Darth Vader isn't my father.

You can make links like this:
I'm reading about <a href="http://en.wikipedia.org/wiki/Darth_Vader">Darth Vader</a> on Wikipedia!

You can quote someone like this:
Darth Vader said <blockquote>Luke, I am your father.</blockquote>

Leave a Reply to Ander Cancel reply

Your email address will not be published.