{"id":49806,"date":"2020-04-23T06:00:12","date_gmt":"2020-04-23T10:00:12","guid":{"rendered":"https:\/\/www.shamusyoung.com\/twentysidedtale\/?p=49806"},"modified":"2020-04-23T09:45:31","modified_gmt":"2020-04-23T13:45:31","slug":"49806","status":"publish","type":"post","link":"https:\/\/www.shamusyoung.com\/twentysidedtale\/?p=49806","title":{"rendered":"Scraping Part 1: Easy Mode"},"content":{"rendered":"<p>You might remember a couple of months ago I posted <a href=\"?p=49109\">a bunch of charts of video game data<\/a>. The obvious question that went unanswered in those posts<span class='snote' title='1'>To the genuine annoyance of some.<\/span> was, &#8220;Where did this data come from?&#8221; So let&#8217;s talk about that.<\/p>\n<p>Actually, before we talk about that I should make it clear that this is a programming project. I should note that that this project pre-dates that <a href=\"?p=49749\">crazy stuff I was doing with BSP loading<\/a> a couple of weeks ago, but I&#8217;m posting them in the opposite order. For some reason.<\/p>\n<p>Maybe reading yet another programming project <strong>sounds<\/strong> fun, but this isn&#8217;t a game-focused project with cool screenshots to show off my project. This is pretty dry and you&#8217;ve already seen <a href=\"?p=49109\">the end result<\/a>. I&#8217;d talk you out of reading more, but we both know you&#8217;re going to read this stupid thing no matter what I say. So Let&#8217;s just get this over with.<\/p>\n<p><!--more--><\/p>\n<p>For years, I&#8217;ve been wondering about the stuff we&#8217;re always discussing \/ arguing about in gaming culture. The division between fans and critics. The difference between platforms. The changes to the industry over time.<\/p>\n<p>The problem is that we never have any numbers to work with. We just sloppily take our anecdata<span class='snote' title='2'>Anecdotes extrapolated into &#8220;data&#8221;.<\/span> and project it onto the industry as a whole. Just about everyone realizes this isn&#8217;t a scientific way of going about things, but we don&#8217;t really have any alternatives. It&#8217;s either guessing based on personal experience, or we chow down on the PR slop the various publishers feed us<span class='snote' title='3'>Or should we read quarterly reports aimed at shareholders, and swallow THEIR slop?<\/span>.<\/p>\n<p>Do particular DRM schemes impact audience reaction or sales? Do console generations impact PC sales? Do single-player games with tacked-on multiplayer actually sell \/ score higher than games without those features? Does review-bombing impact sales, or is the practice just a harmless but cathartic way of expressing outrage? It feels like critics and consumers have been drifting apart in terms of what they say about games, but is that perceived gap reflected in the review scores?<\/p>\n<p>I suppose at the root of it was a general curiosity about the decision-making happening at the big publishers. We can&#8217;t see what game budgets are, we don&#8217;t have access to reliable sales figures, and without those numbers we have no way of even guessing about how much particular games are making or losing. Sites like <a href=\"https:\/\/www.vgchartz.com\/\">VGCharts<\/a> and <a href=\"https:\/\/steamspy.com\/\">SteamSpy<\/a> give us some estimates to play around with, but for the most part we&#8217;re stuck in the dark.<\/p>\n<p>However, it seemed like there was <b>some<\/b> data out there. We can&#8217;t answer all our questions, but maybe we can fill in a few more blanks. Wikipedia has a lot of information on game features and developers. Steam has information on DRM and system requirements. And of course Metacritic has the key information regarding critical reception.<\/p>\n<p>So the obvious question is: If there&#8217;s a bunch of data available to the public, then why don&#8217;t we just round it up? (Preferably without having to do it by hand.)<\/p>\n<h3>How Do You Do That?<\/h3>\n<p>The process of having a program load web pages and pull out desired information is called <a href=\"https:\/\/en.wikipedia.org\/wiki\/Web_scraping\">Web Scraping<\/a>. I&#8217;ve never written a web scraper before, but I&#8217;d always wanted to try it out. It just seems like a fun idea to have a program surf the web for you and bring back a great big haul of information. Maybe, deep down, this project was more about my desire to write a web scraper than to study the resulting data. But this project seemed like a fun way to satisfy both of these curiosities.<\/p>\n<p>As I discovered, the process of building a web scraper is pretty easy. For a project at this small scale, I&#8217;d even say it goes from &#8220;easy&#8221; to &#8220;trivial&#8221;. All told, this whole project was much less than a week of work. If you handed this project off to someone who knew what they were doing, they could probably finish in a couple of days.<\/p>\n<p>In the old days, I would have done this with C++. But now <a href=\"?p=38840\">I&#8217;ve spent time time with Unity<\/a> and learned just enough C# to be dangerous. Since that project I&#8217;ve wanted to play around with C# apart from Unity so I could get a feel for what C# is &#8220;really&#8221; like. The environment that comes with Unity has a ton of game-specific features, and it&#8217;s not always clear to a newbie which things you&#8217;re using are &#8220;standard C#&#8221; and which bits come with Unity<span class='snote' title='4'>I&#8217;d sort of assumed that Unity-specific stuff would have Unity-specific includes, but it&#8217;s also possible Unity comes bundled with some third-party things and conventions.<\/span>. In Unity projects, the engine controls the loop. Tens of thousands of lines of invisible<span class='snote' title='5'>Invisible to the game developer. I&#8217;m going to assume people working on the engine can see their own code.<\/span> code might be run before Unity gets around to reaching the bits of the program you&#8217;ve written. In vanilla C#, program execution begins and ends with your code<span class='snote' title='6'>Okay, there&#8217;s probably a little bit of stuff the program does that&#8217;s invisible to a regular C# programmer, but that&#8217;s NOTHING compared to the gargantuan task that Unity does when it creates a window, launches a rendering pipeline, initializes the sound system, loads assets, and a thousand other things.<\/span>, and I wanted to get a feel for how that worked.<\/p>\n<h3>The Hardest Thing is Realizing how Easy it is.<\/h3>\n<p>The biggest thing that held me back was my learned habits. I&#8217;m used to the C++ world where you need to do everything by hand or spend time trying to figure out how to make <a href=\"?p=9557\">alien code<\/a> work with your program. Want to parse some text? Write a text parser. Want to read web pages? You&#8217;d better know how to implement your own HTTP stack, including networking, DNS lookups, HTTP requests, and a dozen other things I also don&#8217;t know how to do. (Or you could import a library that might not do what you want, or might not have documentation, and might not even compile.)<\/p>\n<p>I kept assuming tasks were going to be hard. I&#8217;d get half an hour into writing something from scratch, and then I&#8217;d realize there was already a tool for it that was effortless to import and completely intuitive to use. A lot of this project was less about programming and more about learning how to find out what (if any) programming needs to be done.<\/p>\n<p>The best example of this is when I tried to write code to parse web pages. At first I did the naive thing:<\/p>\n<ol>\n<li>If you&#8217;re a new programmer that learned to code on a very high-level language with lots of convenience features, then the naive assumption is that there&#8217;s a library out there that will do all the work for you, and all you need is to copy a couple of lines of code from StackOverflow.<\/li>\n<li>If you&#8217;re a dusty old greybeard with knowledge of the Old Ways and ANSI C, then the naive thing is to assume you&#8217;ll need to do everything by hand, painstakingly juggling small blocks of memory and writing dozens of lines of code to accomplish simple things.<\/li>\n<\/ol>\n<p>I was the second kind of naive. I wrote a text parser that would take the contents of an entire webpage as one big string and look for fragments I was interested in. For example, maybe I&#8217;m scraping data from Metacritic and I want to get the title of the game from the webpage. By inspecting the raw Metacritic HTML manually, I&#8217;ve discovered that\u00a0the title of the game is contained in a &lt;div&gt; tag with a class of &#8220;gametitle&#8221;<span class='snote' title='7'>It&#8217;s more complex than this in practice, but this works as an example.<\/span>. So the HTML code might look like:<\/p>\n<pre lang=\"html\"><div class=\"gametitle\">Shoot Guy IV: Shoot Harder<\/div><\/pre>\n<p>So my program downloads the\u00a0page, loads it into memory, and I have it search the HTML for &#8220;gametitle&#8221;.\u00a0 Then I look forward for the nearby closing bracket &#8220;&gt;&#8221;. Then I&#8217;d search for the next opening bracket &#8220;&lt;&#8220;. In theory, the title of the game should be between those two points.<\/p>\n<p>The problem with this sort of approach is that it&#8217;s incredibly fragile. If the website suffers a redesign, then it could lead to chaos in my code. Maybe in the new design, the &#8220;gametitle&#8221; div is a container for the title of the game, plus the cover image, some publisher info, and some random branding logos. There&#8217;s no telling how my parser would handle that, and the odds are extremely high that it would extract a random block of HTML markup \/ CSS as the title of the game.<\/p>\n<p>I knew this wasn&#8217;t the &#8220;Right&#8221; way to do it, but I was anxious to get the thing up and running before I began learning the &#8220;right&#8221; way to do things, which I assumed would take a long time.<\/p>\n<p>The next day I came back to the project<span class='snote' title='8'>And perhaps to my senses.<\/span> and started looking for something to help me parse these web pages. I realized I was going to have to make different parsers for all the different websites I might need to deal with, and rather than making three or four parsers, it would probably be smarter to just bite the bullet and use someone else&#8217;s library.<\/p>\n<h3>The Lazy Way is Also the Right Way?<\/h3>\n<p><div class='imagefull'><img src='https:\/\/www.shamusyoung.com\/twentysidedtale\/images\/stock_lazy.jpg' width=100% alt='This is exactly what it looked like when I worked on this project, except I&apos;m a man, I&apos;m twice her age, I&apos;m not in a wheelchair, I wasn&apos;t using a laptop, my office is never this bright, and I&apos;m not a stock photo model. Okay, so this picture has nothing to do with the project. I just wanted to break up this wall of text.' title='This is exactly what it looked like when I worked on this project, except I&apos;m a man, I&apos;m twice her age, I&apos;m not in a wheelchair, I wasn&apos;t using a laptop, my office is never this bright, and I&apos;m not a stock photo model. Okay, so this picture has nothing to do with the project. I just wanted to break up this wall of text.'\/><\/div><div class='mouseover-alt'>This is exactly what it looked like when I worked on this project, except I&apos;m a man, I&apos;m twice her age, I&apos;m not in a wheelchair, I wasn&apos;t using a laptop, my office is never this bright, and I&apos;m not a stock photo model. Okay, so this picture has nothing to do with the project. I just wanted to break up this wall of text.<\/div><\/p>\n<p>As an old-school C \/ C++ programmer, my expectation is:<\/p>\n<ul>\n<li>Spend ages going through a half dozen similar libraries. Some are in production but incomplete. Some are more complete but were abandoned a decade ago. Some seem more or less complete but have very little documentation in English. Spend a couple hours trying to figure out which of these seems like the least bad, and then download it.<\/li>\n<li>Spend ages trying to figure out how to get this to compile, because there are a dozen ways to do this and everyone thinks their method is obvious \/ optimal.<\/li>\n<li>Read the docs and figure out how to use the damn thing. Spend hours incorporating it into my code.<\/li>\n<li>Discover that this library lacks some obvious, fundamental feature and I&#8217;m going to need to do some ugly workaround to fix it.<\/li>\n<li>Get frustrated and disillusioned. Tell myself I&#8217;ll try one of the other libraries tomorrow.<\/li>\n<li>Shelve the project and never come back to it.<\/li>\n<\/ul>\n<p>That&#8217;s the workflow I&#8217;m used to for hobby projects. Here is what I <em>actually experienced<\/em> while working on this project:<\/p>\n<ul>\n<li>I spend two minutes searching and discover that just about everyone uses <a href=\"https:\/\/html-agility-pack.net\/\">Html Agility pack<\/a>. It promises to do everything I need and it doesn&#8217;t appear to be abandonware.<\/li>\n<li>I&#8217;ve never used an external library in C# so I have to endure a 5-minute learning curve to figure out where you go to do this. It turns out there&#8217;s a handy package manager, like they have in Linux-land. Once I know how to find it and talk to it, the process is completely seamless. It downloads the code and I can start using it right away.<\/li>\n<li>I read the docs and realize I barely need them. Everything is pretty straightforward.<\/li>\n<li>I discover that Html Agility pack contains far <b>more<\/b> features than I realized. Not only can it parse HTML for me, but it can fully understand the HTML and do complex searches for me. With one line of code I can do a complex query like, &#8220;Find the first element with the class of &#8220;gamelist&#8221;, then find the first &lt;OL&gt; element inside of THAT, and then return an array of all of the &lt;LI&gt; items inside of it.<\/li>\n<\/ul>\n<p>Even though I didn&#8217;t know anything about the library, I didn&#8217;t know how to obtain and use libraries, and wasn&#8217;t sure what I was doing, this way was faster and easier than what I did yesterday. As a bonus, it&#8217;s way less code. Yesterday&#8217;s parser code was about a page long. This one is less than a dozen lines of code.<\/p>\n<p>I feel vaguely guilty. I feel like a gardener who&#8217;s been shoving around a <a href=\"https:\/\/www.youtube.com\/watch?v=e5vCJz3mK6w\">manual push reel mower<\/a> for his entire career and now I discover someone has been giving away free riding mowers for the last 20 years. I don&#8217;t know if I feel guilty for using this decadently easy system, or if I feel guilty that I spent two decades of my life breaking my back with this ancient hunk of metal when easier alternatives were free for the taking. Maybe somehow I feel both kinds of guilt at the same time.<\/p>\n<p>The other thing that made this trivial is that my performance requirements were incredibly lax. If this program was going to be running at scale on a dedicated server, then I might need to worry about efficiency. Maybe I&#8217;d need to watch the memory footprint, or do something with multiple threads, or whatever. But this program was going to use my mid-tier residential internet connection with a single IP address. Network throughput will always be the bottleneck in that setup, so any other optimizations exist only as amusements to gratify the programmer&#8217;s particular obsessions or passions. You can optimize that text parser until it runs like Carmack-level assembly code, but it&#8217;ll never make the program faster in a way that will be detectable to the user.<\/p>\n<p>Next time I&#8217;ll talk about what the scraper is actually doing. If you thought <strong>this<\/strong> one was boring, just wait until I start talking about databases.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>You might remember a couple of months ago I posted a bunch of charts of video game data. The obvious question that went unanswered in those postsTo the genuine annoyance of some. was, &#8220;Where did this data come from?&#8221; So let&#8217;s talk about that. Actually, before we talk about that I should make it clear [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[66],"tags":[],"class_list":["post-49806","post","type-post","status-publish","format-standard","hentry","category-programming"],"_links":{"self":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts\/49806","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=49806"}],"version-history":[{"count":12,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts\/49806\/revisions"}],"predecessor-version":[{"id":49818,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts\/49806\/revisions\/49818"}],"wp:attachment":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=49806"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=49806"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=49806"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}