My bot is nowWell, not RIGHT now. This series was written after the bot was completed. downloading pages from Metacritic, one at a time, at the rate of a page every couple of seconds. This would be painfully slow if we were trying to read something large-scale, but right now we’re just scraping for PC games that scored above 30 over the last 19 years. That’s well under 1,000 games.
Of course, downloading these pages isn’t useful unless I can pull information out of them. Much earlier in this series I mentioned I’m using the Html Agility pack. This library can parse HTML for me and return the bits I’m interested in.
One of the funny things about this project is that I’m so far out of my comfort zone / area of expertise that I don’t even know what I don’t know. Not only am I likely making lots of hilarious blunders, but I don’t even know that I’m making them.
This is strangely liberating. When I know what I’m doing, then every cut corner makes me feel vaguely guilty. But when you don’t know what you’re doing, you’re free of the obligations to do things the Right Way(tm) because you don’t know what the right way is! As far as I know, I’ve just written the best web scraper in the history of scrapingDespite the lack of proof, I’m fairly confident that I have not actually written the best web scraper in the history of scraping..
Continue reading 〉〉 “Scraping Part 4: THE FINAL CHAPTER”
T w e n t y S i d e d
