{"id":49835,"date":"2020-04-28T06:00:36","date_gmt":"2020-04-28T10:00:36","guid":{"rendered":"https:\/\/www.shamusyoung.com\/twentysidedtale\/?p=49835"},"modified":"2020-04-29T03:55:03","modified_gmt":"2020-04-29T07:55:03","slug":"scraping-part-2-full-control","status":"publish","type":"post","link":"https:\/\/www.shamusyoung.com\/twentysidedtale\/?p=49835","title":{"rendered":"Scraping Part 2: Full Control"},"content":{"rendered":"<p>So there are thousands of webpages that have information we want. When faced with this problem, ancient civilizations used to go to these pages using Internet Explorer 6 and copy the data into Notepad. We don&#8217;t know what they did with it after that, because they got eaten by Woolly Mammoths or conquered by Mongols or whatever. I&#8217;m not a historian so I might be slightly off with my timeline, but you get the basic idea: <em>The past was hard<\/em>.<\/p>\n<p>But now we have these newfangled web scrapers that can surf the web for you and harvest whatever data you like. The problem is that putting the data into Notepad isn&#8217;t terribly helpful. Great, now you have an enormous text file of random facts. Are you going to sit down and read it manually? Probably not. So what do we do? Write <b>another<\/b> program to read <b>that<\/b> file? You need to turn this text into data sooner or later, and to do that we need to put it into a database.<\/p>\n<p><!--more--><\/p>\n<p>Honestly, I&#8217;m not totally sure what I&#8217;m trying to accomplish with this project. All I know is that I want to put this data into a format where I can manipulate it and look for patterns<span class='snote' title='1'>And actually, I&#8217;m just trying to do something plausibly useful in C#.<\/span>. To do this, I need to stick the information into a database. <\/p>\n<h3>Databases are Cool<\/h3>\n<p><div class='imagefull'><img src='https:\/\/www.shamusyoung.com\/twentysidedtale\/images\/stock_server_room.jpg' width=100% alt='I couldn&apos;t find a stock photo of an abstract concept of a database, so here&apos;s a random woman leaning against the windows of the server room in a way that looks cool but is probably against some stupid rule.' title='I couldn&apos;t find a stock photo of an abstract concept of a database, so here&apos;s a random woman leaning against the windows of the server room in a way that looks cool but is probably against some stupid rule.'\/><\/div><div class='mouseover-alt'>I couldn&apos;t find a stock photo of an abstract concept of a database, so here&apos;s a random woman leaning against the windows of the server room in a way that looks cool but is probably against some stupid rule.<\/div><\/p>\n<p>A database will let you do a query like:<\/p>\n<p>&#8220;In alphabetical order, list the title of all PC games that came out between 2006 and 2010 that scored more than 50 on Metacritic.&#8221;<\/p>\n<p>In fact, that exact question would look like this:<\/p>\n<pre lang=\"SQL\">SELECT title FROM `games` WHERE YEAR(release_date) >=2006 and \r\n  YEAR (release_date) <= 2010 and score_critic>50 and \r\n  platform=\"PC\" ORDER BY title\r\n<\/pre>\n<p>(Currently, 777 games in my database meet this criteria.)<\/p>\n<p>It&#8217;s not quite plain English, but I still find the language of SQL to be amazingly readable.<\/p>\n<p>I&#8217;ve talked about an API before. &#8220;Application programming interface&#8221;. Basically, it&#8217;s a way for different systems to talk to each other. Normally I&#8217;d have to hunt down an API for this, but since we&#8217;re working in C# I evidently already have it, it&#8217;s effortless to use, and it&#8217;s really well-documented. <\/p>\n<p>I look around nervously. Is this okay? Am I cheating somehow? What&#8217;s the catch here<span class='snote' title='2'>Actually, the catch is that I don&#8217;t get access to the low-level power of C++, but on a project like this that&#8217;s not even a disadvantage.<\/span>? In the last entry someone asked why I didn&#8217;t use Python for this. The obvious answer is that I&#8217;m not fluent in Python, which would slow me down. The less obvious answer is that I can&#8217;t imagine how it can get any easier than this. <\/p>\n<p>Anyway. Back to using an API&#8230;<\/p>\n<p>Let&#8217;s say you&#8217;re on Twitter even though it&#8217;s <a href=\"https:\/\/www.shamusyoung.com\/twentysidedtale\/?p=46541\">terrible<\/a>. Let&#8217;s also say you want to reply to <a href=\"https:\/\/twitter.com\/mike_acton\/status\/1227288980443426818\">some random tweet<\/a>. You want to reply with something witty and incisive. So you hit the reply button and type &#8220;no your wrong dumass lol&#8221; into the text box. Then you hit the &#8220;Send&#8221; and smile with satisfaction, knowing that you&#8217;ve once again enriched the world with your boundless wisdom.<\/p>\n<p>But let&#8217;s say you&#8217;re using an API so a program can spread your potent wisdom-nuggets far and wide. I don&#8217;t know what the real Twitter API looks like and I couldn&#8217;t possibly care enough to look it up, but here&#8217;s a mock-up:<\/p>\n<pre lang=\"csharp\">\r\n\/\/I don't know why I hardcoded this to reply to a single tweet.\r\n\/\/Maybe I should re-think this design.\r\nint reply_to_id = 1227288980443426818;\r\nstring message = \"no your wrong dumass lol\";\r\nTweetSend (reply_to_id, message);<\/pre>\n<p>Whatever. Something like that. Maybe if you wanted to send a tweet that wasn&#8217;t a reply you could set the reply message to 0.<\/p>\n<pre lang=\"csharp\">\r\nTweetSend (0, \"why do I have so few followers when these rich sexy famous ppl get so many this system is so rigged lol\");\r\n<\/pre>\n<p>Obviously a real bot would need to do complicated stuff like <b>logging in<\/b> before it could initiate wisdom-sharing mode, but you hopefully get the idea. The controls that are normally available to users are turned into bits of code that your program can call.<\/p>\n<p>What I find fascinating is that the SQL API doesn&#8217;t work this way. If you want a list of all PS4 games in my database, you&#8217;d type this into the SQL console:<\/p>\n<pre lang=\"SQL\">SELECT title FROM `games` WHERE platform='PS4'<\/pre>\n<p>But if you&#8217;re a program, then you would accomplish that by&#8230; doing the exact same thing. The syntax of SQL is so ridiculously complex that&#8230;<\/p>\n<p>Actually, it would be easier if I just showed you:<\/p>\n<pre lang=\"SQL\">SELECT\r\n\r\n[ALL | DISTINCT | DISTINCTROW ]\r\n\r\n[HIGH_PRIORITY]\r\n\r\n[STRAIGHT_JOIN]\r\n\r\n[SQL_SMALL_RESULT] [SQL_BIG_RESULT] [SQL_BUFFER_RESULT]\r\n\r\n[SQL_CACHE | SQL_NO_CACHE] [SQL_CALC_FOUND_ROWS]\r\n\r\nselect_expr [, select_expr] ...\r\n\r\n[into_option]\r\n\r\n[FROM table_references\r\n\r\n[PARTITION partition_list]]\r\n\r\n[WHERE where_condition]\r\n\r\n[GROUP BY {col_name | expr | position}\r\n\r\n[ASC | DESC], ... [WITH ROLLUP]]\r\n\r\n[HAVING where_condition]\r\n\r\n[ORDER BY {col_name | expr | position}\r\n\r\n[ASC | DESC], ...]\r\n\r\n[LIMIT {[offset,] row_count | row_count OFFSET offset}]\r\n\r\n[PROCEDURE procedure_name(argument_list)]\r\n\r\n[into_option]\r\n\r\n[FOR UPDATE | LOCK IN SHARE MODE]\r\n\r\n\r\ninto_option: {\r\n\r\nINTO OUTFILE 'file_name'\r\n\r\n[CHARACTER SET charset_name]\r\n\r\nexport_options\r\n\r\n| INTO DUMPFILE 'file_name'\r\n\r\n| INTO var_name [, var_name] ...\r\n\r\n}<\/pre>\n<p>Warning: Don&#8217;t actually try to read all of that. It&#8217;s not what it says that matters, it&#8217;s how much space it takes up. That huge block of impenetrable options is the documentation for the SQL SELECT statement. Just SELECT! On top of that are countless other commands you can use, each with their own complex syntax and list of options.<\/p>\n<p>Each line above represents an option for obtaining, filtering, sorting, comparing, and presenting information. Trying to turn all of that into a proper API would be a nightmare. And documenting that API would be another, extra nightmare.<\/p>\n<p>So rather than create this endless collection of functions to represent all the possible things you can do with SQL, the entire API just shrugs and allows you to send it blocks of text like a user. It&#8217;ll even return the same error codes if you make the same mistakes.<\/p>\n<p>This is completely hilarious to me. <\/p>\n<h3>Perhaps an Analogy Will Help<\/h3>\n<p><div class='imagefull'><img src='https:\/\/www.shamusyoung.com\/twentysidedtale\/images\/stock_empty_office.jpg' width=100% alt='Ugh. I&apos;m glad I never had to work in a place like this. The GERMS, the noise, the lack of privacy, and limited personal space would make me feel like cattle.' title='Ugh. I&apos;m glad I never had to work in a place like this. The GERMS, the noise, the lack of privacy, and limited personal space would make me feel like cattle.'\/><\/div><div class='mouseover-alt'>Ugh. I&apos;m glad I never had to work in a place like this. The GERMS, the noise, the lack of privacy, and limited personal space would make me feel like cattle.<\/div><\/p>\n<p>Imagine this scene:<\/p>\n<div class=\"script\">INT &#8211; Generic Tech Company Office &#8211; FRIDAY AFTERNOON<\/p>\n<p><em>According to the clock on the wall, it&#8217;s late afternoon. The colorless, windowless office is mostly empty. A young coder sits at his desk, wearing a wrinkled shirt and a leftover 70s tie that&#8217;s hanging at half-mast. He&#8217;s propped up on one elbow and typing with one hand, searching for images of sunshine and beaches. He keeps slamming into the office&#8217;s automated content blocking system.<\/p>\n<p>Suddenly, THE BOSS strides out of her office. She&#8217;s wearing an expensive suit with sunglasses perched on her forehead. She&#8217;s got a head of meticulously cared-for Karen hair and face incapable of smiling. She pulls the keys to her sports car out of her handbag and walks past Coder without a word.<\/p>\n<p>Once she&#8217;s halfway to the door, she stops and turns.<\/em><\/p>\n<h4>BOSS:<\/h4>\n<p>Oh! I almost forgot&#8230; I have something I need you to do.<\/p>\n<h4>CODER:<\/h4>\n<p>(Stirs from his catatonic state.) What, me?<\/p>\n<h4>BOSS:<\/h4>\n<p>What was your name again?<\/p>\n<p><em>(CODER opens his mouth to answer, but she dismisses him with a hand-wave.)<\/em><\/p>\n<h4>BOSS:<\/h4>\n<p>Whatever. That&#8217;s not important right now. We need you to make an interface for a self-driving car.<\/p>\n<h4>CODER:<\/h4>\n<p>(Looks at calendar.) Uh&#8230; really? How long do we have?<\/p>\n<h4>BOSS:<\/h4>\n<p>Until Monday, so you&#8217;ve got the whole weekend if you need it. (She swings down her sunglasses and pivots towards the door.)<\/p>\n<h4>CODER:<\/h4>\n<p>Wait! I mean&#8230; (flounders for a few seconds) What are the specs? What does the user need to be able to do?<\/p>\n<h4>BOSS:<\/h4>\n<p>(She sighs and reluctantly turns back to face him.) I think that should be obvious. They just need to be able to tell the car where to go. If they want to go to the spa, it&#8217;ll take them to the spa. Or the beach. Or home. Whatever.<\/p>\n<h4>CODER:<\/h4>\n<p>Yeah, but what level of granularity do they need? Do they need to be able to specify specific routes? Choose between scenic or fast routes? Do they need to control the speed? Climate control? The radio? The horn? Lights?<\/p>\n<h4>BOSS:<\/h4>\n<p>Full control.<\/p>\n<h4>CODER:<\/h4>\n<p>Full control? Like, what do you mean?<\/p>\n<h4>BOSS:<\/h4>\n<p>(Firmly.) They should be able to direct the car to do anything that they could do themselves.<\/p>\n<p>She spins on her heel and heads for the exit. A blinding beam of golden sunlight enters the room as she shoves open the door, which gradually fades again as it eases closed. There&#8217;s a dull thud as it hits home, and now it somehow seems even darker in here than before she left. <\/p>\n<p>Coder stares shell-shocked at his screen, motionless. The office is now silent aside from the ticking of the clock.<\/p>\n<p>We hold on this angle for an EXCESSIVELY LONG TIME as the camera very gradually dollys in. Finally we stop just as his face fills the view. Finally, he smiles slightly and begins nodding his head.<\/p>\n<p>SLAM CUT TO:<\/p>\n<p>INT &#8211; Generic Tech Company Office &#8211; MONDAY MORNING<\/p>\n<p><em>Boss is looking at the phone app that Coder created over the weekend. It&#8217;s the image of a driver&#8217;s view of a car interior, evidently drawn freehand in MS Paint. It shows the wheel, pedals, gear shift, console controls, etc. All the little controls are animated when you swipe \/ tap on them. This is connected to the car on a 1:1 basis, so you control the &#8220;self driving&#8221; car by manipulating this picture of the driver&#8217;s seat.<\/p>\n<p>Furious, Boss storms out of her office and over to Coder&#8217;s desk. His chair is empty, and his PC is running a slideshow of beach selfies. There&#8217;s a sticky note on his monitor that says &#8220;FULL CONTROL&#8221;.<\/em><\/p>\n<p>&#8211; Fin<\/p><\/div>\n<p>I suppose I made the SQL API sound bad with this analogy. I actually quite like it. I just find it hilarious that the &#8220;interface&#8221; is simply direct access to the human controls. It does indeed give the programmer &#8220;full control&#8221;, but this isn&#8217;t how an API is normally designed.<\/p>\n<p>But here it works. In fact, I can&#8217;t imagine how this system could be improved. Most of the APIs I&#8217;ve used have fallen into one of two broad categories:<\/p>\n<ol>\n<li>Low-level systems like rendering, sound, and device input. An API like this will tend to be sort of inscrutable in its design and very exacting about what you can do and how your input needs to be formatted. The API is generally a small number of highly technical functions. They&#8217;re built around a vanilla C paradigm where you need to share direct memory pointers with the software.<\/li>\n<li>High-level stuff like the user interface. The API has many, many functions to remember, but they&#8217;re all pretty obvious in their usage. These APIs tend to adhere to a very strict <a href=\"?p=35692\">OOP<\/a> design.<\/li>\n<\/ol>\n<p>But SQL is something else entirely. It&#8217;s possible that there are a lot of APIs like this out there, but I&#8217;ve never run into them because of my quasi-gamedev background. The SQL query language is really complex and powerful, so the programmer is simply given the freedom to use that language to communicate with the database. <\/p>\n<p>So now we need to pull some data out of our database, so the first thing we need to do is-<\/p>\n<h3>Am I Forgetting Something?<\/h3>\n<p>Hang On, I seem to have skipped a step here. We can&#8217;t take anything OUT of the database until we put something INTO the database. And we can&#8217;t put anything into the database until we scrape for the data.<\/p>\n<p>I&#8217;m afraid I&#8217;ve led you astray in this entry. Next time we&#8217;re going to back up and harvest our data from the web. Hopefully we can do this without getting caught or banned by the sites we&#8217;re scraping. We&#8217;re definitely getting close to <a href=\"https:\/\/en.wikipedia.org\/wiki\/Grey_hat\">Grey Hat<\/a> territory here, particularly if my bot is ill-behaved.<\/p>\n<p>I promise the next entry will be less flagrantly disorganized.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>So there are thousands of webpages that have information we want. When faced with this problem, ancient civilizations used to go to these pages using Internet Explorer 6 and copy the data into Notepad. We don&#8217;t know what they did with it after that, because they got eaten by Woolly Mammoths or conquered by Mongols [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[66],"tags":[],"class_list":["post-49835","post","type-post","status-publish","format-standard","hentry","category-programming"],"_links":{"self":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts\/49835","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=49835"}],"version-history":[{"count":30,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts\/49835\/revisions"}],"predecessor-version":[{"id":49865,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=\/wp\/v2\/posts\/49835\/revisions\/49865"}],"wp:attachment":[{"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=49835"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=49835"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.shamusyoung.com\/twentysidedtale\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=49835"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}