How Many Words?

By Shamus
on May 7, 2017
Filed under:
Random

I have been doing this site for a dozen years, but the question didn’t occur to me until now. I noticed the three-year anniversary of my Patreon campaign was coming up and I was looking for a way to quantify my overall output. The question is:

How many words do I write in a year?

Of course, this number will go up and down from year to year. Some years my big project is a comic that will naturally be more image than words. Other years I end up posting most of my words on the Escapist. Sometimes I’ll focus on video content and sometimes I’ll lose my mind and write over a hundred thousand words about one videogame franchise.

But still. Even if I don’t have a convenient way to measure stuff I’ve done for other sites, we ought to be able to get some sort of handle on how many words I write on this site, right? I mean, I’ve got the database right here. (You can’t see it, but I’m holding up the database and gesturing with it right now.) That should have all the information we need.

I suppose the first step is to filter out the stuff not written by me. To date, 5,025 posts have been published on this site. (This includes posts you haven’t read yet, like the future entries in my Arkham City and Zenimax vs. Facebook series.) 321 of them have been written by other people, and the remaining 4,704 posts were written by me. So all we need to do is get a word count on those posts and we’ll have what we need, right?

Well…


Not all word counts are created equal. In fact, as far as I can tell none of them are. If we look at the textI’m talking about the raw text I see in the editor, which – due to markup – is different from the text you read on the site. of the very first entry of my Final Fantasy X series, we’ll see that WordPress reports it as being 1,930 words long. If I take that exact same text and post it into Google DocsI write most of my long-form stuff in Google Docs. The editor is more comfortable, the spelling and grammar checking is more robust, and I can’t accidentally publish a half-finished article when trying to save my work. it gives me a word count of 2,149. That is not a small difference! If I paste the same text into this word counter it tells me the text is 2,085 words. And if you copy & paste that post somewhere else for a count, you’ll probably get another answer entirely.

I’m assuming the difference comes down to HTML markup. For example, in the paragraph above I’ve got the sentence:

In fact, as far as I can tell <em>none of them are</em>.

One word counter probably counts special characters like < and / as word breaks, and another only counts whitespace. So one will see “<em>none” as the word “em” followed by “none”, and the other will see it as one big word. From experimenting, it looks like the WordPress counter is actually smart enough to pull out HTML, so “em” won’t get counted at all. Which means the WordPress count is probably the number we’re interested in.

This is still not perfect. I think the WordPress counter gets confused by the shortcode markup I use for images, footnotesLike this one., and YouTube embeds. But this is basically close enough for our purposes.

The more serious problem is that the word count isn’t stored in the database. If I want to know the word count of a post I have to open up the post in the editor and look at it. Not to sound lazy, but I don’t actually want to spend two full work days opening up 4,704 individual posts in the WordPress editor. The editor is not snappy and it does not open quickly. It takes several seconds to open a postYou can see why I prefer to write in Google Docs! and there’s no good way to navigate between published posts chronologically.

About the only thing we have to work with on the database side is a brute-force character count of the text. That’s ugly. I think the best I can do is look at the character counts and compare them to the displayed word count. That should give me a ballpark “characters per word” that I can use to derive the numbers we need.

I gather up the last dozen posts and look up their word counts. I put them in a Google spreadsheet along with their character counts, and it tells me I average about 6.64 characters per word. So if the database tells me a post is 10,000 characters long, it means it was in the ballpark of 1,506 words.

That sounds high. The average in standard English text is 5.1 characters per word. While I’d love to claim that I’ve got one o’ them fancy vocabularies that lets me use a lot of fifty cent words, I think this inflation of word length is more appropriately blamed on all the HTML and shortcode. Still, maybe the last couple of weeks have been atypical? Just for the sake of completeness, I do the same experiment again using every single post I wrote in June of last year. During that time I wrote 37 posts. I open each one in the editor to get the official HTML-free word count. Added together they come to 34,337 words. If I divide the number of characters by this number I get 6.63.

Wow. That’s amazingly consistent. I think I can proceed feeling pretty confident that I wrote a word for every 6.6 characters in a post. In any case, we finally have what we need to answer this stupid question: How many words do I write per year?

All I need to do is get the numbers out of the database. I’ll admit I’m pretty rubbish at SQL. I’m one of those people who knows juuust enough to be dangerous. I never have the guts to perform changes to the database via mySQL. My interactions are strictly read-only. Here is what I come up with:

SELECT SUM(CHAR_LENGTH(post_content)) FROM wp_posts WHERE post_author='1' AND 
post_status='publish' AND post_date >= '2017-01-01' AND post_date < '2018-01-01' LIMIT 10000;

(The “limit 10000” is because I’m feeding these queries into the database via phpMyAdmin, and if you don’t specify a limit it defaults to 10.)

If I do one of these for every year since the site’s inception, it should give me the character counts. Note that this count is based on the year starting Jan 1st, and not in September when the site was launched. This means the first “year” is just a few months long.

Taking those results and assuming 6.6 characters = 1 word, I get:

Yes this should be counting image mouseover text like the stuff you`re reading now.

Yes this should be counting image mouseover text like the stuff you`re reading now.

That dip in 2010 is when I was making content three times a week for The Escapist. Plus I still had a day job. The Patreon campaign began in 2014, and that’s when I really started treating the site like a full-time job. I wasn’t happy with my output at the end of year one, mostly due to problems in my personal life and some projects that hit a dead end without ever turning into blog posts. But I’m pretty happy with my output since then.

To put these numbers in context, your typical young adult novel is somewhere in the 50k to 90k words ballpark. I think the first Harry Potter book is probably around 75k or so. Hefty adult books are maybe double that. The Two Towers clocks in at 156k words. Which means last year I wrote 4.5 Harry Potters worth of content, or a little more than The Two Towers + Return of the King.

Well, it’s a bit more complicated than that. This is actually a count of what I posted and not what I wrote. For instance, last year I did re-posts of old Escapist content. But I also did some non-trivial edits to those things. I don’t want to haggle over where we draw the line between “writing”, “re-writing”, and “editing”, and so let’s just ignore this while I make dismissive hand-wavey motions.

I’m kind of surprised by my output in 2006. That’s a lot of words. On the other hand, I think those are pretty low-quality words. They’re mostly random dashed-off thoughts. They’re barely proofed, there aren’t any images, and almost no links. The stuff I’m writing now is more analysis. It’s researched, proofread, and annotated. It’s got lots of screenshots with captions. The transformation probably began when DMotR took off and I became aware I was writing for thousands and not just a small group of friends.

While we’re at it, let’s look at how many posts I’ve put up every year:

Posts per year.

Posts per year.

Again, that is not at all what I would have expected. I don’t remember being nearly that busy in 2006. But I guess I can’t argue with the data. I do remember hearing the advice, “You should make sure to post once every day!” and taking it to heart. Like most blogging advice, this is misleading. It’s true that the most popular blogs have regular content. In the same way, many successful men wear suits every day. But wearing a suit every day will not make me successful. The actual advice you’re looking for is, “Write stuff that other people want to read.” But that’s sort of obvious and nobody knows how to teach other people to do it. So instead we get shallow advice like, “Post every day” and “Check your SEO performance”, because those are things you can quantify.

At any rate, the “Post every day” mindset resulted in me posting a lot of ephemeral dross in the early days of the site. That declining red bar graph is probably a good indicator of an overall rise in quality.

Since I’ve already got the data in a spreadsheet, I might as well look at how long-winded I’m becoming. Here is the average word length of posts:

Number of words per post.

Number of words per post.

Poor 2010. I guess I was just posting Spoiler Warning videos and links to my Escapist content.

Well, I don’t know if that was interesting, but it was a fun little project.

And because I know you’ll be curious at this point: According to WordPress, this post is 1,705 words long.

Enjoyed this post? Please share!

Footnotes:

[1] I’m talking about the raw text I see in the editor, which – due to markup – is different from the text you read on the site.

[2] I write most of my long-form stuff in Google Docs. The editor is more comfortable, the spelling and grammar checking is more robust, and I can’t accidentally publish a half-finished article when trying to save my work.

[3] Like this one.

[4] You can see why I prefer to write in Google Docs!


202020262 comments? This post wasn't even all that interesting.

From the Archives:

  1. Jokerman says:

    Shamus… talking of words, where did “You ok, buddy?” come from (yes, i know that was shitty attempt to make this on topic) is it Arkham Asylum? Currently playing through the game, and after knocking one guy out, another said it in the middle of the brawl…

  2. CliveHowlitzer says:

    I appreciate all of your many words. Nowadays, most things are in videos and it is increasingly rare to find words, especially a lot of words.

    Continue to be overly verbose, good sir.

  3. Pete_Volmen says:

    I assume many here already know this video, but This (short) Tom Scott video explains many of the differences between word count beyond stuff like HTML tags. What counts as a word can be tricky.

  4. 4th Dimension says:

    BTW, for future reference Shamus, you didn’t have to do each year with a different query (also those should have returned only a single row, so no need for limit), in SQL you can aggregate/group rows using GROUP BY word. So you could have used:

    SELECT YEAR(post_date) as Year, SUM(CHAR_LENGTH(post_content))/6.6 AS Word_count
    FROM wp_posts
    WHERE post_author=’1′ AND
    post_status=’publish’
    GROUP BY YEAR(post_date)
    LIMIT 10000;

  5. MichaelGC says:

    I think it should be “o'” rather than “‘o” for when you say “o’,” there. Well, strictly you say “‘o” when I say “you say “o'”,” of course, but I didn’t want to write “‘o” whilst also suggesting it should be “o'” when you say “‘o.” I’m helping!

  6. David W says:

    I suspect the important bit of ‘post something every day’ is ‘get into a habit where you write a lot and get feedback’. Practice helps so much with quality!

    • Phill says:

      Pretty much. Regular new content is what keeps people checking in on a site daily, and if there are too many days with nothing new, they’ll drop it from their daily round (not everyone’s browsing routine is the same of course).

      But aiming to post every day trains you into the habit of doing it, and probably also helps train the important skill of finding stuff worth posting about so you don’t just run out of content in two weeks when your original bout of things to say has run dry.

      • Daimbert says:

        Yeah, on my blog when I was posting every day I got a significant increase in hits, and was able to get into a routine of writing posts and ensuring that I had enough. However, that wasn’t sustainable, and my hits dropped when I went down to 3 posts a week, and I noticed that if I ever stopped my regular posting it was really hard to get going again. So posting every day gets people to check in more often, read more posts one shot, and also gets you in the habit of ensuring that you are regularly posting, and ensuring that you have things to crank out relatively quickly if you end up being busy (I tended to use “Philosophy and Pop Culture” posts for that).

    • TMC_Sherpa says:

      While I don’t remember all the events that happened in 2006 or any of them if I’m being honest, I am reasonably certain it was shorter than 600 days

  7. Tuck says:

    A lot of that stuff in 2006 is the D&D campaign? Those posts are pretty long on average, and there’s quite a lot of them. Incidentally, I started running some friends through that campaign, with a few small modifications (e.g. using Norse gods, and with a different background since they didn’t have a previous campaign to build off), and we were having a lot of fun until the group couldn’t get the timing right to keep going!

  8. Bunkerfox says:

    Little short on the total word count in 2017 there buddy. You’re gonna have to step it up a gear

  9. MichaelG says:

    “Well, it’s a big more complicated than that” “bit”, not “big.”

    Thanks for the words!

  10. Duoae says:

    Interesting to do this little bit of self analysis. I’ve always been jealous of your long form essays and I’m guessing practice makes perfect. Are you finding it easier to structure and plan those sorts of content?

  11. NoneCallMeTim says:

    Not only that, but there are all the comments on the posts which may not be posts, but add value.

  12. Henson says:

    Well that’s interesting. You used ’em’ tags for slanting letters rather than ‘i’ tags. Is this inherent to WordPress, or your own preference?

    I wonder if I can also use the ’em’ tags

    EDIT: Yup.

    • Shamus says:

      If I’m typing them manually, I use /i by habit, but if you use the built-in WordPress slanting, it favors /em

    • Philadelphus says:

      “/em” tags are technically slightly better as they capture the essence of what they’re being used for without mandating a particular form—so for instance a browser for blind people can see the /em tag and interpret it in some way that makes sense, vs. an /i tag that makes no sense when you’re reading content aloud. (Well, in theory anyway, in practice I’m sure browsers for the blind and whatnot deal with /i tags just fine for pragmatic reasons.)

      I think most browsers are also smart enough to switch from italics back to normal text if they encounter an /em tag within an /em tag in accordance with the general style rules of emphasizing things within italics, but I don’t know that I’ve ever tried it, so let’s find out:

      This is normal text, this is emphatic text, this is really emphatic text, back to just normally emphatic (whatever that means), back to normal.

      Edit: Ah, no it doesn’t work in Chrome at least. In fact it seems kinda buggy given that the close tag after “really emphatic text” keeps the next eight words from being italicized as they should be within a single enclosing /em layer. Oh well.

      • Henson says:

        I assumed that the reason for using /em is that it is easier to register at a glance, for anyone looking at the html code. I think a whole bunch of /i tags could give a confusing and less visually clean look.

        • silver Harloe says:

          Interesting assumption, but Philadelphus is right. The w3c is open about the history of their decisions and their motivations to make all markup semantic, and let designers sort out placement and appearance in style-sheets without sullying the beauty of their pure markup.

  13. Son of Valhalla says:

    Yeah, since I binge read through the whole site last year, you ended up posting a lot of garble during 2006. Then you hit DMotR and there’s a strange turn in content and what you post. Like more analytical and proofed writing.

    It probably helped traffic to the site, though. I imagine that because of the regular output and communication with other bloggers during your first year, the site eventually gained enough traction to have a viral hit, which ended up being DMotR.

    • Mousazz says:

      Not to shame Shamus (heh.) or anything, but 2006 was host to posts such as Bridge bunnies.

      I realize that’s a completely extreme example, but if I had to wait half as long to get a post like that, I wouldn’t believe my eyes and past experience. Shamus’ standard of quality for his blog definitely increased over the years.

  14. Collin says:

    So whats the bell curve like then? First deviation between 200-250,000 words per year, with SDs of around 30,000?

  15. Philadelphus says:

    Slight typo in “All I need to do I get the numbers out of the database.”

    Interesting analysis! Last year when my own blog hit six years old I did a similar graph of posts over time, which followed a very similar trend of decreasing (mostly due to it being simply a side project rather than a source of income for me, and having started it in college when I apparently had a lot more free time than after having a full-time job).

  16. Syal says:

    That sounds high. The average in standard English text is 5.1 characters per word

    The answer lies in swears. The average swear typically uses the standard four-letter variations, while this site tends to use more flowery insults like ‘Solipsistic’, and ‘Enormodouche’.

    Dropping conjunctions will raise the letter-to-word ratio too. Instead of “I’m going to go to the store to buy some eggs”, you can use the more efficient “Store, travel, eggs, enormodouche!”

    • The Rocketeer says:

      Look, I admire your enthusiasm, but Twenty Sided ain’t amateur hour. If you aren’t ready to brand someone an absquatulant rantallion or a dasypygal batrachomyomachist, you don’t even step up to the plate.

      • Syal says:

        Inefficiency… hurting…

        “Enthusiasm, Twenty Sided amateur ain’t. Aren’t ‘absquatulant rantallion’, ‘dasypygal batrachomyomachist’, don’t plate step. Enormodouche!”

        There.

        • Decius says:

          Doubleplusungood, citizen.

          Enthusiasm plusgood, noted. Amateurism ungood. Twentysided insults absquatulant rantallion, dasypygal batrachomyomachist, else report for reassignment to position compatible with personal limitations.

          • LCF says:

            Ungood word choice, citizen.

            Joy plusgood, noted. Not-know ungood. Twentysided insults absquatulant rantallion, dasypygal batrachomyomachist, else ask for position change in line with self abilities.

            Need shorter words. Less Latin, more Anglo, less unthought.

  17. Ninety-Three says:

    Typo patrol (I’d post this on the article itself, but that article has comments disabled): Your about page says “an nontraditional”.

  18. Matt Downie says:

    The ‘Harry Potter’ is probably not the clearest measure of word output, since the later Harry Potter books were about two Harry Potters long.

  19. skeeto says:

    This morning after reading this article, I downloaded the text of every single Twenty Sided article (apologies to your server, Shamus) because I wanted to see how the Flesch-Kincaid score changed, or didn’t change, over time. I used BeautifulSoup to extract the authorship, date, and article body as plain text, HTML tags stripped. It took a bit of massaging because you’ve got a lot of comments that aren’t the valid UTF-8 your server claims it is. As an interesting note, the text analysis also choked on this article because of the super long name of that New Zealand hill, so I had to manually delete that word.

    I predicted the Flesch-Kincaid score would decrease (i.e. require a higher reading level), reflecting an increase in article quality. I was right, but only barely. It decreased very slightly over time, but staying within a 6th grade reading level.

    Shamus’s Flesch-Kincaid score per month

    Another interesting note: Rutskarn, the only other author with a significant number of posts, writes articles at a 7th grade reading level (average score of 78).

    I’d link the highest and lowest scoring articles, but it’s actually not very interesting. The extremes contain code and/or some kind of text sample not actually written by its author (i.e. the articles about spam), so the Flesch-Kincaid score is suspect.

    I was also able to directly measure the word count, and it looks exactly the same, so your estimate is fine: Word count per Year

    If anyone wants to poke at the data themselves, here it is: http://skeeto.s3.amazonaws.com/share/twenty-sided-posts.csv

    The columns are: id, author, year, month, day, unix-epoch, length, words, flesch-kincaid. I’m happy to run another analysis, or share more of the data, if anyone asks. I’m stumped about more ways to look at this data beyond what’s already been said.

    • Shamus says:

      Thanks for sharing. That was really interesting. I’m not surprised I’m writing at such a “low level”. I think those scores are calibrated for fiction, where long sentences and and complex structure is part of the appeal. But I spend a lot of time trying to make convincing arguments or explaining technical things, and in those cases clarity trumps art.

      Although I dunno. Maybe me no write good.

      • djw says:

        I teach physics, and I have come to appreciate the value of short and simple sentences for communicating complicated ideas.

        Also, frequent paragraph breaks.

      • skeeto says:

        Your point about clarity is spot on. I don’t think you should necessarily try write to a more difficult reading level, nor is using an easier reading level bad or stupid. Everyone should use the easiest reading difficulty that effectively communicates their thoughts. Higher difficulties than necessary shrink the size of the audience for little benefit.

        For comparison, my own blog (all programming articles) has an average reading score of 73 (7th grade) with little change over the years.

        • Echo Tango says:

          I would argue that keeping a low reading-level is nearly required, if you’re talking about complicated, jargon-dense subject-matter, like Shamus. There’s enough complicated things for the reader to keep track of, without making the structure of the English sentences themselves overly complex. :)

      • Syal says:

        When I see these grade-level ratings I always wonder whether high school textbooks actually get rated at the grade’s reading level, or if everything caps out lower nowadays.

        EDIT- oh, it was made for the military. Maybe it’s using military grades then; ‘sixth-grade level’ isn’t 12-year-olds, it’s First Class Petty Officers.

      • Son of Valhalla says:

        I don’t really see writing at a higher grade level (6th grade’s fairly high on the Flesh-Kincaid scale) as being anything bad, necessarily. I would even say that the big reason why people read this site is because of that smartness in your arguments/entertainment articles.

        Just ma two centz.

      • Miguk says:

        I definitely prefer your writing to someone like the Digital Antiquarian who uses big words for the sake of using big words.

    • Worthstream says:

      Would you mind making that scraped content available somewhere? Hoping it compresses to a manageable size, since it should be mostly plain text.

      I’m thinking of using it for a project, but don’t want to hammer the servers with another scraping.

  20. GM says:

    What Browser do people use? i just suddenly went back to firefox from chrome after the changes

    of couple of years althrough i used icefox or something last i remember.

  21. Daemian Lucifer says:

    And if we add images to this,counting every image as a thousand words,and if we add video,counting every video as 60 images per second(because 60 fps 4 life,yo!),and discount sound only because no one ever said anything witty about that,we get…
    .
    .
    .
    30 seconds of time wasted on writing this lame joke.

  22. Zantaros says:

    I think that the “post every day or other regular interval” advice is still valuable, but more from the standpoint of trying to start making content in the first place and avoid writer’s block than from the standpoint of actually gaining more views.

    It’s something that Alex Steacy talks about a lot: often people who want to create content are concerned about minute details of quality to the point that it is difficult for them to start, and he advises to just start making content regularly regardless of quality in order to get comfortable with the experience of doing so.

  23. Dev Null says:

    As a quick-and-dirty solution that isn’t _quite_ as blunt-force as counting characters and dividing by 6.6, you can count the word breaks. This solves the problem of _most_ HTML tags, because [em]thing stuff fish[/em] still counts as 3 words. It breaks on multiple consecutive spaces and weird punctuation like ” – “, but it’s still liable to be close.

    A quick google search found me this solution for multiple consecutive spaces:
    http://stackoverflow.com/questions/6940646/mysql-how-to-remove-double-or-more-spaces-from-a-string

    So using clean_spaces from that post (you might need to tweak it to remove double line breaks too), something like:

    SELECT YEAR(post_date) as Year,
    CHAR_LENGTH(clean_spaces(post_content)) – CHAR_LENGTH(REPLACE(REPLACE(post_content, ‘ ‘,”)), ‘\n’, ”) + 1 as ‘Word Count’
    FROM wp_posts
    WHERE post_author=’1′ AND
    post_status=’publish’
    GROUP BY YEAR(post_date)
    LIMIT 10000;

  24. Mike C says:

    That sounds high. The average in standard English text is 5.1 characters per word

    Does that average include spaces? Your count of your own posts (dividing total characters by word count) is including spaces, so if that referenced standard doesn’t, that explains at least 1 character of the discrepancy right there.

    I just tried to find a source, and was lead to Wolfram Alpha, which cites an average of 5.1 characters per English word. I read that to mean that whitespace and non-word punctuation is not included.

    Subtracting that one character for spaces, and, say, a half character for punctuation including HTML delimiters, and you’re a lot more in line with the 5.1 figure.

  25. Some nerd says:

    I feel like that first graph should be normalized by number of days in the year. Leap years won’t change much, but the first and last year are huge outliers and it might be good to get a sense of where you are this year compared to last year.

Leave a Reply

Comments are moderated and may not be posted immediately. Required fields are marked *

*
*

Thanks for joining the discussion. Be nice, don't post angry, and enjoy yourself. This is supposed to be fun.

You can enclose spoilers in <strike> tags like so:
<strike>Darth Vader is Luke's father!</strike>

You can make things italics like this:
Can you imagine having Darth Vader as your <i>father</i>?

You can make things bold like this:
I'm <b>very</b> glad Darth Vader isn't my father.

You can make links like this:
I'm reading about <a href="http://en.wikipedia.org/wiki/Darth_Vader">Darth Vader</a> on Wikipedia!

You can quote someone like this:
Darth Vader said <blockquote>Luke, I am your father.</blockquote>