Scraping Part 2: Full Control

By Shamus Posted Tuesday Apr 28, 2020

Filed under: Programming 83 comments

So there are thousands of webpages that have information we want. When faced with this problem, ancient civilizations used to go to these pages using Internet Explorer 6 and copy the data into Notepad. We don’t know what they did with it after that, because they got eaten by Woolly Mammoths or conquered by Mongols or whatever. I’m not a historian so I might be slightly off with my timeline, but you get the basic idea: The past was hard.

But now we have these newfangled web scrapers that can surf the web for you and harvest whatever data you like. The problem is that putting the data into Notepad isn’t terribly helpful. Great, now you have an enormous text file of random facts. Are you going to sit down and read it manually? Probably not. So what do we do? Write another program to read that file? You need to turn this text into data sooner or later, and to do that we need to put it into a database.

Honestly, I’m not totally sure what I’m trying to accomplish with this project. All I know is that I want to put this data into a format where I can manipulate it and look for patternsAnd actually, I’m just trying to do something plausibly useful in C#.. To do this, I need to stick the information into a database.

Databases are Cool

I couldn't find a stock photo of an abstract concept of a database, so here's a random woman leaning against the windows of the server room in a way that looks cool but is probably against some stupid rule.
I couldn't find a stock photo of an abstract concept of a database, so here's a random woman leaning against the windows of the server room in a way that looks cool but is probably against some stupid rule.

A database will let you do a query like:

“In alphabetical order, list the title of all PC games that came out between 2006 and 2010 that scored more than 50 on Metacritic.”

In fact, that exact question would look like this:

SELECT title FROM `games` WHERE YEAR(release_date) >=2006 AND 
  YEAR (release_date) <= 2010 AND score_critic>50 AND 
  platform="PC" ORDER BY title

(Currently, 777 games in my database meet this criteria.)

It’s not quite plain English, but I still find the language of SQL to be amazingly readable.

I’ve talked about an API before. “Application programming interface”. Basically, it’s a way for different systems to talk to each other. Normally I’d have to hunt down an API for this, but since we’re working in C# I evidently already have it, it’s effortless to use, and it’s really well-documented.

I look around nervously. Is this okay? Am I cheating somehow? What’s the catch hereActually, the catch is that I don’t get access to the low-level power of C++, but on a project like this that’s not even a disadvantage.? In the last entry someone asked why I didn’t use Python for this. The obvious answer is that I’m not fluent in Python, which would slow me down. The less obvious answer is that I can’t imagine how it can get any easier than this.

Anyway. Back to using an API…

Let’s say you’re on Twitter even though it’s terrible. Let’s also say you want to reply to some random tweet. You want to reply with something witty and incisive. So you hit the reply button and type “no your wrong dumass lol” into the text box. Then you hit the “Send” and smile with satisfaction, knowing that you’ve once again enriched the world with your boundless wisdom.

But let’s say you’re using an API so a program can spread your potent wisdom-nuggets far and wide. I don’t know what the real Twitter API looks like and I couldn’t possibly care enough to look it up, but here’s a mock-up:

//I don't know why I hardcoded this to reply to a single tweet.
//Maybe I should re-think this design.
int reply_to_id = 1227288980443426818;
string message = "no your wrong dumass lol";
TweetSend (reply_to_id, message);

Whatever. Something like that. Maybe if you wanted to send a tweet that wasn’t a reply you could set the reply message to 0.

TweetSend (0, "why do I have so few followers when these rich sexy famous ppl get so many this system is so rigged lol");

Obviously a real bot would need to do complicated stuff like logging in before it could initiate wisdom-sharing mode, but you hopefully get the idea. The controls that are normally available to users are turned into bits of code that your program can call.

What I find fascinating is that the SQL API doesn’t work this way. If you want a list of all PS4 games in my database, you’d type this into the SQL console:

SELECT title FROM `games` WHERE platform='PS4'

But if you’re a program, then you would accomplish that by… doing the exact same thing. The syntax of SQL is so ridiculously complex that…

Actually, it would be easier if I just showed you:

SELECT
 
[ALL | DISTINCT | DISTINCTROW ]
 
[HIGH_PRIORITY]
 
[STRAIGHT_JOIN]
 
[SQL_SMALL_RESULT] [SQL_BIG_RESULT] [SQL_BUFFER_RESULT]
 
[SQL_CACHE | SQL_NO_CACHE] [SQL_CALC_FOUND_ROWS]
 
select_expr [, select_expr] ...
 
[into_option]
 
[FROM table_references
 
[PARTITION partition_list]]
 
[WHERE where_condition]
 
[GROUP BY {col_name | expr | POSITION}
 
[ASC | DESC], ... [WITH ROLLUP]]
 
[HAVING where_condition]
 
[ORDER BY {col_name | expr | POSITION}
 
[ASC | DESC], ...]
 
[LIMIT {[offset,] ROW_COUNT | ROW_COUNT OFFSET offset}]
 
[PROCEDURE procedure_name(argument_list)]
 
[into_option]
 
[FOR UPDATE | LOCK IN SHARE MODE]
 
 
into_option: {
 
INTO OUTFILE 'file_name'
 
[CHARACTER SET charset_name]
 
export_options
 
| INTO DUMPFILE 'file_name'
 
| INTO var_name [, var_name] ...
 
}

Warning: Don’t actually try to read all of that. It’s not what it says that matters, it’s how much space it takes up. That huge block of impenetrable options is the documentation for the SQL SELECT statement. Just SELECT! On top of that are countless other commands you can use, each with their own complex syntax and list of options.

Each line above represents an option for obtaining, filtering, sorting, comparing, and presenting information. Trying to turn all of that into a proper API would be a nightmare. And documenting that API would be another, extra nightmare.

So rather than create this endless collection of functions to represent all the possible things you can do with SQL, the entire API just shrugs and allows you to send it blocks of text like a user. It’ll even return the same error codes if you make the same mistakes.

This is completely hilarious to me.

Perhaps an Analogy Will Help

Ugh. I'm glad I never had to work in a place like this. The GERMS, the noise, the lack of privacy, and limited personal space would make me feel like cattle.
Ugh. I'm glad I never had to work in a place like this. The GERMS, the noise, the lack of privacy, and limited personal space would make me feel like cattle.

Imagine this scene:

INT – Generic Tech Company Office – FRIDAY AFTERNOON

According to the clock on the wall, it’s late afternoon. The colorless, windowless office is mostly empty. A young coder sits at his desk, wearing a wrinkled shirt and a leftover 70s tie that’s hanging at half-mast. He’s propped up on one elbow and typing with one hand, searching for images of sunshine and beaches. He keeps slamming into the office’s automated content blocking system.

Suddenly, THE BOSS strides out of her office. She’s wearing an expensive suit with sunglasses perched on her forehead. She’s got a head of meticulously cared-for Karen hair and face incapable of smiling. She pulls the keys to her sports car out of her handbag and walks past Coder without a word.

Once she’s halfway to the door, she stops and turns.

BOSS:

Oh! I almost forgot… I have something I need you to do.

CODER:

(Stirs from his catatonic state.) What, me?

BOSS:

What was your name again?

(CODER opens his mouth to answer, but she dismisses him with a hand-wave.)

BOSS:

Whatever. That’s not important right now. We need you to make an interface for a self-driving car.

CODER:

(Looks at calendar.) Uh… really? How long do we have?

BOSS:

Until Monday, so you’ve got the whole weekend if you need it. (She swings down her sunglasses and pivots towards the door.)

CODER:

Wait! I mean… (flounders for a few seconds) What are the specs? What does the user need to be able to do?

BOSS:

(She sighs and reluctantly turns back to face him.) I think that should be obvious. They just need to be able to tell the car where to go. If they want to go to the spa, it’ll take them to the spa. Or the beach. Or home. Whatever.

CODER:

Yeah, but what level of granularity do they need? Do they need to be able to specify specific routes? Choose between scenic or fast routes? Do they need to control the speed? Climate control? The radio? The horn? Lights?

BOSS:

Full control.

CODER:

Full control? Like, what do you mean?

BOSS:

(Firmly.) They should be able to direct the car to do anything that they could do themselves.

She spins on her heel and heads for the exit. A blinding beam of golden sunlight enters the room as she shoves open the door, which gradually fades again as it eases closed. There’s a dull thud as it hits home, and now it somehow seems even darker in here than before she left.

Coder stares shell-shocked at his screen, motionless. The office is now silent aside from the ticking of the clock.

We hold on this angle for an EXCESSIVELY LONG TIME as the camera very gradually dollys in. Finally we stop just as his face fills the view. Finally, he smiles slightly and begins nodding his head.

SLAM CUT TO:

INT – Generic Tech Company Office – MONDAY MORNING

Boss is looking at the phone app that Coder created over the weekend. It’s the image of a driver’s view of a car interior, evidently drawn freehand in MS Paint. It shows the wheel, pedals, gear shift, console controls, etc. All the little controls are animated when you swipe / tap on them. This is connected to the car on a 1:1 basis, so you control the “self driving” car by manipulating this picture of the driver’s seat.

Furious, Boss storms out of her office and over to Coder’s desk. His chair is empty, and his PC is running a slideshow of beach selfies. There’s a sticky note on his monitor that says “FULL CONTROL”.

– Fin

I suppose I made the SQL API sound bad with this analogy. I actually quite like it. I just find it hilarious that the “interface” is simply direct access to the human controls. It does indeed give the programmer “full control”, but this isn’t how an API is normally designed.

But here it works. In fact, I can’t imagine how this system could be improved. Most of the APIs I’ve used have fallen into one of two broad categories:

  1. Low-level systems like rendering, sound, and device input. An API like this will tend to be sort of inscrutable in its design and very exacting about what you can do and how your input needs to be formatted. The API is generally a small number of highly technical functions. They’re built around a vanilla C paradigm where you need to share direct memory pointers with the software.
  2. High-level stuff like the user interface. The API has many, many functions to remember, but they’re all pretty obvious in their usage. These APIs tend to adhere to a very strict OOP design.

But SQL is something else entirely. It’s possible that there are a lot of APIs like this out there, but I’ve never run into them because of my quasi-gamedev background. The SQL query language is really complex and powerful, so the programmer is simply given the freedom to use that language to communicate with the database.

So now we need to pull some data out of our database, so the first thing we need to do is-

Am I Forgetting Something?

Hang On, I seem to have skipped a step here. We can’t take anything OUT of the database until we put something INTO the database. And we can’t put anything into the database until we scrape for the data.

I’m afraid I’ve led you astray in this entry. Next time we’re going to back up and harvest our data from the web. Hopefully we can do this without getting caught or banned by the sites we’re scraping. We’re definitely getting close to Grey Hat territory here, particularly if my bot is ill-behaved.

I promise the next entry will be less flagrantly disorganized.

 

Footnotes:

[1] And actually, I’m just trying to do something plausibly useful in C#.

[2] Actually, the catch is that I don’t get access to the low-level power of C++, but on a project like this that’s not even a disadvantage.



From The Archives:
 

83 thoughts on “Scraping Part 2: Full Control

  1. DeadlyDark says:

    SQL is fun. It’s an example of a declarative language, where you say what you want to receive and the program will oblige you. It’s nice to have that, instead of typing every single step of the execution yourself. Not as fun, though.

    What database did you use, btw? SQL Server?

    1. Shamus says:

      I used whatever comes with a WAMP server, which (looking now) is indeed SQL server.

      1. Viruzzo says:

        The “M” in WAMP stands for MySQL, and it’s a stack born from LAMP (i.e. same things, except on Linux), so SQL Server was never the choice.

        1. Shamus says:

          See, here we’re running into the limitations of my knowledge as a dB admin.

          When the OP asked if I was using SQL, I thought they were asking SQL vs. InnodB, which is the choice WAMP gives you at install. But now I see there’s a distinction between SQL Server and mySQL, which I thought was basically the same thing.

          So I’ll revise my answer to “I don’t know what I’m doing. I clicked on the thing and it all seems to work so I try not to touch it without a spectacularly good reason.”

          1. Lazlo says:

            And, interestingly, what it was asking wasn’t a question of “SQL or something else”, it was asking about the backend storage engine that the MySQL SQL server was going to use. The thing is, SQL is just the “structured query language” that abstracts you away from having to know or care how the database stores or accesses the data it’s keeping. SQL doesn’t have an API, because SQL *IS* the API.

            1. MrPyro says:

              Yeah, SQL vs InnoDB doesn’t make much sense to me as a question. One is the query language (which isn’t even an option in MySQL; I don’t think it speaks anything other than SQL) and the other the storage engine used under the hood.

              Might have been MyISAM vs InnoDB?

            2. Duffy says:

              In theory yes, but practically there’s subtle differences where queries don’t 100% translate very quickly once you get beyond some the basic CRUD style stuff. Theirs an incredibly common sorting query I use in SQL Server that needs to be completely rewritten in MySQL. Then you start getting into each ones specific functions or extra advanced operators they added. I love SQL Server’s MERGE, saves so much effort when I can use it.

              Then theirs the abomination that is Entity Framework….I kid…well mostly kid.

      2. Lazlo says:

        I… Don’t think it is? WAMP is an acronym(ish) for Windows, Apache, MySQL, PHP. So I suspect the SQL server you’re using is MySQL.

        Things are, of course, much more clear due to Microsoft cleverly naming their SQL server the memorable name of “SQL Server”

        1. DeadlyDark says:

          Yeah, gotta say, I should’ve specify it by writing MS SQL Server, but alas

  2. Inviscid says:

    I’ve always seen SQL more the other way around: the user has direct access to the API, which just happens to be in its own language (SQL) instead of whatever you are programming in. In my head this makes sense as a way to split the data from the program that is using it.

    Also, since this is a good excuse to bring up Mass Effect: some time ago I scraped (and converted to TeX) your Mass Effect retrospective to read on a long flight. I was surprised by how easy it was to scrape, interpret, and navigate the general layout of the site. The large number of special boxes and frames were fairly annoying to interpret though.

    1. Echo Tango says:

      +1 for SQL as an API. The people writing database-software can do whatever they want to organize their data in memory, the source-code, and on disk, but as long as they support SQL, you as the user of a running instance of their database, never need to deal with any of that. :)

    2. Scerro says:

      As a DBA, it’s an extremely nice clear line for me to draw. Need data management? I’ll give you a username/password, server name, name of DB to point at, and developers get to knock themselves out.

      Converting SQL databases to one another is a pain in the butt, but possible.

      Seems like using some offline mode Chrome browser functions would have been easier than scraping a DB, but if you got it done, fair enough.

  3. methermeneus says:

    “Bored Dev 2: Full Life Control”

    I’ve played around with the old C/C++ SQL libraries, and they do pretty much the same thing. There’s functions for connecting/disconnecting, and a few maintenance things and common requests, but you can ignore all of that, because there’s a function that just sends SQL a command string and returns the results as a string. This has been the way of dealing with SQL forever.

    Actually, it’s been the way of dealing with language interfaces forever. If you use Lua scripting in a game, for instance, you’re picking up the script as a string and just feeding it into a Lua interpreter. Same basic idea.

    1. Viruzzo says:

      The difference there is that Lua is using effectively as a DSL, the interface of which can be tailored for safety and expressiveness; with SQL you’re usually handing out the keys to whole castle.

      1. baud says:

        I’ve used at work a Java SQL library that does a lot more that just working with strings (setting parameters), which was developed (I think) to protect against SQL injection from user-inputted strings (cf xkcd Little Bobby Tables)

        1. Noumenon72 says:

          I’ve been waiting forever to get a chance to use jOOQ for Java. It reads your schema and makes an object for everything, so instead of strings you can use type-safe, auto-completing, parameterized Java code like
          Query query1 = create.select().from(AUTHOR).where(LAST_NAME.eq("Poe"));

          1. Innards says:

            I’ve jOOQ with Kotlin and it fucking rocks

      2. Kyte says:

        The only way you hand the keys to your kingdom with SQL is if your db admin is doing a very bad job at securing your stuff. Pretty much every SQL command can be locked down for a given user, not to mention tables and whatnot.

        1. Viruzzo says:

          Not true, actually: compromising SELECT is enough to extract most data from the database, with automated tools. From there you usually have a way into some user management system, and then who knows what.

  4. Monojono says:

    I think the standard way for c# to connect to SQL isn’t to use an API but an object relational mapping (ORM) like nhibernate or dapper where you map your objects to SQL. A good ORM will let you write a linq query and will automatically convert this to SQL behind the scenes without the programmer having to write any selects or anything.

    1. Retsam says:

      ORMs are fairly common, though I have no idea whether they’re more or less common than raw SQL. They’re somewhat infamous as being leaky abstractions: in part just because databases are complicated, and in part because they’re an object-oriented layer on top of a fundamentally not object-oriented database. It gives you an API that looks nice and simple, but has some pitfalls and performance “gotcha”s.

      It’s definitely not a great fit for what Shamus is doing, as the linked blog post mentions, most of the complexity of ORMs is built around making a bi-directional flow with the database: if he’s just dropping a bunch of data in the data base once and running queries on it, SQL is definitely the way to go, over an ORM.

      As for LINQ, that’s a C# specific feature – it’s essentially the syntax of an SQL query built into the language, and being used for more than just SQL queries. You can write a LINQ query to work with an ORM, you can write LINQ query to build a normal SQL query, you can write a LINQ query to search an array without a database being involved at all:


      var numbers = new List() { 5, 4, 1, 3, 9, 8, 6, 7, 2, 0 };

      var query =
          from num in numbers
          where num 7
          orderby num ascending
          select num;

      // Execute the query.
      foreach (int i in query) {
          Console.WriteLine("Got " i);

      LINQ seems pretty neat, there’s a lot of power in the SQL style queries.

      It’s kind of hilarious in Shamus’s metaphor, this would be like using the picture of a car to control things that aren’t cars. Like using the picture of a car as a television remote: turn the key to turn it on, use the turn signals to change channels, press the gas pedal to increase the volume, etc.

      I’d probably use LINQ for basic SQL queries just because it’d guard against stupid errors. db.query("SELCET * from table") is going to blow up because of a stupid typo in the query (“SELCET”, not “SELECT”), but it’d be a compiler error with LINQ.

      1. Retsam says:

        Whoops, fixed some of the above formatting, but forgot that > and < get eaten in the comment syntax unless escaped. where num 7 was supposed to be where num < 3 || num > 7.

      2. RFS-81 says:

        Huh. I didn’t know we could use the pre tag in comments!

        EDIT: Looks like pre doesn’t work for me. How did you format the code block?

        EDIT2: Is it the code tag?

        1. Retsam says:

          Yeah, it’s the <code> tag, you have to use html entity nonbreaking spaces (“&nbsp;”) for indentation.

      3. Jamie Pate says:

        This could also be written as a bunch of chained methods which is the API that Shamus mentions doesn’t exist ;)

        `numbers.Where(num => num == 7).orderBy(num => num).Select(num => num)`

        (Select is less useless when each item is actually an object (‘row’))

  5. ydant says:

    Not sure how far your research has taken you, but there are (of course) APIs that build the SQL for you. For simple queries these can help a lot. For complex queries, these can be better or worse.

    I’ve never used it (I haven’t used C# in around a decade), but I see LINQ talked about fairly often:

    https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/linq/

    More often than not I end up just using a block of SQL text – it has the advantage of being broadly understandable, without requiring specific framework knowledge.

    Of course, “SQL” doesn’t really exist – every database does its own thing. Luckily they are mostly all pretty close, meaning you can usually just write SQL for any database, and the differences and failures will be rare, but subtle and catastrophic.

    For this, if you can, I’d suggest just using SQLite for your database engine. It works on a file, and doesn’t need anything running. No software to set up, and you can distribute your database (if you want to) as a single flat file. It’s very powerful, and since you’re not supporting multiple users and concurrent updates/queries, it will likely perform perfectly fine for your needs. Once you learn about SQLite and look around, you’ll realize it’s *everywhere*. It’s an amazing piece of open-source (public domain, actually) tech.

    1. Viruzzo says:

      Seconded on SQLite. It should actually perform a lot better, because it doesn’t bother with a lot of the transactional features that full RDBMS have to.

    2. baud says:

      “SQL” doesn’t really exist – every database does its own thing. Luckily they are mostly all pretty close, meaning you can usually just write SQL for any database, and the differences and failures will be rare

      In my job, we’re working on a software that supports both Microsoft SQL Server and Oracle DB. All the SQL code is duplicated, with one version for each database, since there are enough syntax differences that it’s simpler that way.

  6. ElementalAlchemist says:

    Ugh. I’m glad I never had to work in a place like this. The GERMS, the noise, the lack of privacy, and limited personal space would make me feel like cattle.

    So exactly like the 12 years of schooling we subject larval humans to then.

    1. Shamus says:

      Yup.

      And I had the same reaction to school.

    2. Echo Tango says:

      I worked in a call-center like that once; I lasted two weeks.

    3. DeadlyDark says:

      On the other hand, with some deceases, it’s better to expose them to children early enough to boost immunity, than to wait for adulthood for this to happen (iirc, chickenpox is like that). So, one could make a pragmatic argument for the current school system on these grounds

      1. Echo Tango says:

        Pretty much most diseases are better with vaccines than with a real infection. Even the common example, chickenpox, can leave some people with shingles in middle-age. People are less likely to get shingles from the vaccine.

        Even without dubious tradeoffs for germs, schooling provides other benefits. Socialization, skills, knowledge, a place to put your kids all day that’s not a babysitter…

        1. Abnaxis says:

          It’s that last one that matters to me. Coronavirus had actually resulted in my wife and I getting slammed with more work than fits in a 40 hour week, on top of having to care for our 11 month old larval human because we can’t get a babysitter.

          It’s bad enough that I have to feed and change him, I also have to give him human interaction almost constantly! I tried writing a script to automate this crap, but he’s scared of the Roomba so it didn’t pan out…

  7. Viruzzo says:

    Bulding an API that allows direct queries (or similar low-level access) is terrible for interoperability, which is the whole point of having an API in the first place.
    As other have said, most decent API systems use an ORM underneath which does the “heavy lifting” of handling the database side of things; this marries well with a REST API and an OOP design, and even more modern alternatives like GraphQL.

    1. Echo Tango says:

      The syntax of SQL is an API, and does allow for interoperability. You can structure your new-fangled database software, and the files that store the database however you want, as long as it supports the SQL syntax for queries. What do you want instead – custom APIs for every database, that are all totally different? It’s bad enough in our real world, where we’ve had this standard for a long time; It would be worse if there was never this standard in the first place.

      1. Viruzzo says:

        Why would anyone want “custom API for every database”? The API should be for your application, not the database, that’s the point.

    2. Kyte says:

      SQL is the API, it just happens that it’s an API that works through a DSL designed to make queries, without actually saying anything about how those queries have to be executed or how the underlying data has to be organized. Concepts like select, join, where and order by are universal regardless of whether you’re working with tables and rows or a freeform object graph or anything else.

      This is why LINQ is so versatile, it’s nothing more than integrating queries into the programming language (hence LINQ: Language INtegrated Query). You can use it to query databases (in which it directly translates to SQL), but you can also use it to query lists in memory or even to explore a document such as an HTML page.

  8. Kyrillos says:

    I’m fairly certain the standard way to query a database in C# is actually through LINQ, which is a much more agnostic way to parsing data.

  9. Echo Tango says:

    Warning: Don’t actually try to read all of that.

    Uhh…I skimmed / read most of it; I think it’s basically in the format of a regex, or…formal grammer? I think that’s the term for when you’re specifying a language, and not just trying to grab bits of text. :)

  10. Misamoto says:

    Isn’t SQL’s case the reverse? It’s an API that the user has access to, not user interface that you can also program to

    1. Richard says:

      Yup, this.

      Although it’s since become an API that you should never, ever actually use directly because of the Robert’); DROP TABLE *;– problem.

      There are of course a variety of excellent ways to wrap it that ensure things like this very post don’t actually do that.
      – LINQ and prepared statements being two of the more common methods.

  11. Lino says:

    Typolice:

    since we’re working in C# sharp

    One “sharp” too many. Not to be confused with this “sharp”, which had the perfect amount of “sharp” in it. Although I found the “e” a bit excessive…

    1. Retsam says:

      Amusingly, C# sharp, would be C##, which, in music theory, is the same thing as D (sometimes music theorists decide to use double sharps, because reasons), which is also a moderately popular programming language.

      1. Erik says:

        Unless you’re using a 22-tone microtonal system, in which case C## is the note between C# and D, also known as D half-flat. :) (h/t Adam Neely for the useless knowledge.)

    2. tmtvl says:

      But that’s not a sharp, that’s a number sign. &#266F; is a sharp (lessee if WP handles Unicode).

      EDIT: it doesn’t, so here is a sharp.

      1. Echo Tango says:

        Pedantry like this is why homograph attacks work.

        1. tmtvl says:

          Which is why you use whitelist-based SSL verification when surfing, and reproducible builds for installing software.

  12. Retsam says:

    A great way to get a feel for the complexity of SQL is to look at this page, which shows the syntax for (the SQLite flavor of) SQL in the form of “railroad diagrams”. So the linked “SELECT statement” diagram illustrates the command as the long grammar Shamus included.

    And each rectangle in that diagram is a placeholder where you need to go refer to one of the other diagrams to get the full-picture.

    1. Echo Tango says:

      Those diagrams would have fit nicer onto a page (without wrapping around), if they’d gone top to bottom for the stages, and left to right for the options at each stage, instead of the other way around. ^^;

  13. Gresman says:

    Now we all wait for the blogpost titled “The advantages of document based databases and why splunk syntax is really not great”. :)

    Shamus, I assume you do not have any experience with document based databases or any kind of data warehousing?

    In my experience these things are quite interesting in their own ways. You might want to just take a look at it. Maybe not using it but looking at it.

  14. Paul Spooner says:

    I think the most complex SQL I’ve interacted with was probably fairly simple by database standards. I was writing a non-linear (or, perhaps poly-linear) webcomic browsing page (essentially a database front-end) which interacted with an implicit linked node hierarchy to browse to sibling nodes. I think it was something like:
    SELECT * FROM comics AS q1 INNER JOIN comic_links AS q2 ON q1.comic_id = q2.child_id WHERE q2.comic_id = curid AND q2.child_offset < curoffset ORDER BY q2 child_offset DESC LIMIT 1
    And then reversing the offset comparator for the other one. And even for something as simple as that I needed to get help from a couple friends to get it working correctly.

  15. Drathnoxis says:

    Yahtzee called you out in his most recent Dev Diary video:

    “…maybe Shamus Young should make 12 games in 12 months if he’s so clever.”

    Now, I don’t think you have any choice but to make your own rival series and show him up!

    1. Gresman says:

      Did Shamus ever state that this would be a trivial task or that he would be able to do such a thing?

      1. Gresman says:

        I watched it and now I get it.
        I am tired and stupid that I did not read Yahtzee’s sarcasm on the first try.

    2. Paul Spooner says:

      Thanks for the heads up. Link with timestamp for the lazy:
      https://youtu.be/SBMxM36-RQ0?t=420

  16. Echo Tango says:

    But now we have these newfangled web scrapers that can surf the web for you and harvest whatever data you like. […] Great, now you have an enormous text file of random facts. […] So what do we do? Write another program to read that file? You need to turn this text into data sooner or later

    As far as I know, reading your data just-in-time like this is becoming a more popular programming paradigm. If you mess up the initial text-to-data conversion before your data-to-facts analysis, you can come back to the original text with a fixed converter. Assuming your processing costs are feasibly small[1], you can just set up your whole pipeline to run straight from your raw HTML/text, to the extracted data-points, to the comparisons between games, to the final output in whatever file[2] you want to read, without needing to explicitly save[3] the data in between each step.

    [1] For converting single pages of HTML/text into video-game numbers, this should be very do-able on the fly.
    [2] Or a web-page showing your results. Whatever.
    [3] This is probably best handled by whatever framework / pipeline / whatever you’re using, with you just specifying what stages need their data cached.

  17. sheer_falacy says:

    I’m sorry, your hypothetical story about the coder and the boss was just too unbelievable and it utterly ruined by suspension of disbelief.

    A tie? Really?

    1. Shamus says:

      It’s not that well-known these days, but the programmer dress code[1] actually has a loophole that allows you to wear a tie, provided:

      1) The tie is unfashionable / out of date.
      2) It is visibly tied improperly, or worn so loose that the knot is below the collarbone.

      There are also some provisions for clip-on ties that come off at inopportune moments, but almost nobody ever bothered with those.

      [1] It’s more of an anti-dress code, I suppose.

      1. King Marth says:

        It is a syntax error to write FORTRAN without wearing a blue tie. [1]

        [1] A Brief, Incomplete, and Mostly Wrong History of Programming Languages

        1. Nimrandir says:

          Huh. All the Fortran programmers I’ve known are mathematicians, who are doggedly anti-tie. If there had been a language with a neckwear requirement, I’d have guessed COBOL.

          1. evileeyore says:

            Sir, I’ll have you know that The Lords of COBOL all wear ceremonial suits and full headdress.

            1. Richard says:

              By your command

    2. Echo Tango says:

      There’s still programming jobs out there with strict dress codes.

      1. pseudonym says:

        Straitjackets! The best way to prevent feature creep and scope creep.

        1. danielfogli says:

          And creeps in general ;)

  18. Dev Null says:

    I spent a lot of years building interfacing to databases with code for a living, one way or another. There are a lot of attempts at APIs out there, that do stupid things like try to parameterize all the pieces of a select statement into a function. They work, are harder to read than the original select that you inevitably have to write first in order to figure out what to put in the function, and every time I’ve used one it has been slower and less efficient than just letting the database engine do the job it was built to do in the first place. Sometimes slower by orders of magnitude.

    Queue someone singing the praises of NoSQL.

  19. Decius says:

    Suppose there was an API for SQL other than the human interface.

    What advantage would it offer to someone who didn’t know how to use the human interface? Could it be somehow easier to learn how to interface with the API than with the interface that is almost human-readable?

    Could the API somehow be more powerful, allowing options not possible by typing commands? That would be a weakness in the typed commands.

    The only advantages I can see is that the internal representation of the inter-module interface would consist of a few fewer bytes, which would make SQL injection attacks orders of magnitude harder to perform and prevent, because instead of ; drop table ‘comments’ or 1==true in the payload, the payload would have to be a byte string that does not render as meaningful text.

    Oh, except that there would very soon be a module that converted human-readable SQL commands into the format that the API needs, because SQL is very well-designed for human interfaces. It would likely be literally the same as how SQL interprets commands. Which SQL already has and does.

    Video card drivers can’t take ‘SHOW CANARY’ as an input, they have to have verticexen and textural generation and stuff, so their APIs are super low-level. Web interfaces can’t start with a drop down that has ‘replytotweet’ and ‘thumbscribe up’ (or even the actual web API names), they have to let humans interact in a way they understand and translate it to the next layer the way the next layer can understand.

    SQL is very unique in that the users can already ask it for what they want in a way that it can understand.

    SQL is powerful enough to be dangerous, thus powerful enough to be useful: “Select ‘itemname’, ‘price’ where ‘itemid’ is ‘”$A1″‘” compares poorly to “Select ‘itemid’, ‘itemname’, ‘price’ where ‘itemid’ IN ‘”$A:$A”‘ order by ‘itemid'”

    1. Paulo Marques says:

      The advantage of a native language API is getting the syntax and data types checked at compile time instead of whenever the query runs, if it runs. You get nifty LINQ thing where you just call a bunch of methods whatever the source is and know you’ll get your results. And it could presumably skip the text translation step, assuming any RDBMS system still provided such a library (as opposed to having multi-RDBMS libraries translating to SQL), although that’s not much of a bottleneck today. The disadvantages is that it starts getting messy the more you filter and aggregate data, especially when supporting different Database Systems, and that it becomes harder to figure out what it’s doing if you need to improve performance.
      But they are still useful, since more often than not you just want to transform some lines on the Database to a list of objects in your language and ORMs allows you to quickly and safely do just that. Or to do an insert/update/upsert while you just care about enforcing whatever constraints you want to have (for example, don’t save the gender of people under 13). It’s just that when you want something a little complicated, you should consider just writing the query.

      1. Echo Tango says:

        You can have type-checking in normal database queries. Google’s interpreter for BigQuery[1] gives you errors if you try to pass a timestamp into a date-time function, or if you try to compare a boolean to a string. It knows the schema of the tables you’re querying too, not just your temp-cariables from your queries. The off-line program that some teams down the hall use[2] also gives full syntax and type-checking. SQL queries aren’t just “plain” text, that’s what poetry and novels are!

        [1] It’s close enough to SQL for all the times use-cases I’ve worked with it. I think it just differs in some of the more obscure helper-functions.
        [2] I can’t remember the name. It’s specifically for reading and writing to databases.

  20. avenger337 says:

    I am not a database person either, but I’m about 98.271% positive that &nbsp; isn’t part of the SQL language :P

    Also, damn, that’s hard to type without it getting interpreted by your web browser. I had to type &amp;nbsp; to type that… and then to type that I had to type &amp;amp;nbsp;, and then to type that… well yea, I’m done now.

  21. kikito says:

    I don’t exactly know where you are going with this, but I will leave this advice here:

    Don’t try to parse HTML using regular expressions.

    Thousands of programmer hours have sacrificed themselves in order to get this information to you.

    1. Echo Tango says:

      Nah, he covered that in the previous entry; He ended up using a common library for parsing the pages. I still don’t know if he used XPaths to find some of the tags[1], or if his paths are all totally hard-coded[2]. :)

      [1] The X in XPath means XML, but I think all the HTML parsing libraries allow you to use them too. HTML is very similar to XML…by design? I forget; I think they’re both children or implementations of some other language…thing.
      [2] I mean, an XPath would be hard-coded too, but it’s like a regex, in that you only need to specify the really important bits. I mean, you could also have some kind of self-adjusting algorithm, where you visually mark a web-page to annotate the different things you care about like game-score, game-title, year, etc, and then have the computer figure out the xpaths. But that would be difficult to get right, probably still brittle, and the output xpaths would probably be incomprehensible…. ^^;

      1. tmtvl says:

        Well you used to have XHTML, which was a thing. Also, I think xslt may be more workable for HTML parsing than XPath, but it depends on the available tooling.

  22. Phil says:

    The SQL query language

    Grrrrr. :)

    1. Richard says:

      At least it’s a well-structured SQL query language ;)

    2. Decius says:

      I run it on the CPU processor of my PC computer.

  23. Taxi says:

    Shamus going grey hat here, even using the Karen meme. The quarantine brings out interesting qualities in people.

  24. Gordon says:

    Everyone who can code knows how to write Python, like normal you start with the pseudo code of what you want to do… congratulations you’ve written Python.

  25. Zak McKracken says:

    Not quite sure what you did with that data, or are (were by now…) intending to do with it, but I’m not sure whether SQL is really needed for just 4000 records.

    There’s probably some data science/analysing person who would shout something about Pandas — but me, I’d have just stuck it in a structured numpy array, or a lowly dictionary, or maybe if I’m feeling like doing something fancy, a Numpy Recarray.
    Or, via a structured array, using np.savetxt(), directly into a CSV file, for processing in a spreadsheet application. That’s in Python of course, but I’m asking myself if C# offers some vaguely similar data structures.

    All that said, it’s perfectly appropriate to use this kind of analysis as an excuse to learn to deal with SQL in C#.

Thanks for joining the discussion. Be nice, don't post angry, and enjoy yourself. This is supposed to be fun. Your email address will not be published. Required fields are marked*

You can enclose spoilers in <strike> tags like so:
<strike>Darth Vader is Luke's father!</strike>

You can make things italics like this:
Can you imagine having Darth Vader as your <i>father</i>?

You can make things bold like this:
I'm <b>very</b> glad Darth Vader isn't my father.

You can make links like this:
I'm reading about <a href="http://en.wikipedia.org/wiki/Darth_Vader">Darth Vader</a> on Wikipedia!

You can quote someone like this:
Darth Vader said <blockquote>Luke, I am your father.</blockquote>

Leave a Reply to Kyte Cancel reply

Your email address will not be published.