Video Game Chart Party 2: Chart Harder

By Shamus Posted Sunday Feb 2, 2020

Filed under: Video Games 48 comments

Two things happened last weekCorrection: More than two things happened last week. However, I only care about two of them right now.. First, I published that chart dump of video game data. Second, I got another delivery. Late last night someone threw a brick through my window. Once I finished sweeping up the glass, I noticed it wasn’t actually a brick. It was a MySql database, wrapped in a note that said, “Don Data sends his regards.”

This data is a lot more complete than the last batch. It has information on publishers / developers, which might be useful if I knew what to do with it. But whatever. Let’s make a new version of the charts from last week:

Here I’ve limited the chart to the twenty years from 2000 to 2019, since the data outside of that range is too sparse to use. Also, this chart just shows the range between 60 (worst game ever) and 80 (best game possible).

Last week some people expressed concern that my methodology was off. Specifically, what happens if Shoot Guy 4: Shoot ’em All gets a 100% from one thousand critics, and if the slightly obscure Punch Guy 3: Knuckles of Doom only gets one review of 0%. If those are the only two games that year, then won’t that result in an average of 99.9% for the year?

Short answer: No. Long answer:

I don’t have individual votes. I only have the aggregates. In this scenario, the average would be 50%. Having said that, I can’t guarantee that this information is correct or useful. Maybe it’s inaccurate. Maybe it’s accurate but I imported it wrong. Maybe I imported it right, but then I messed up the pivot table in the spreadsheet.

Moreover, I don’t have a background in statistics so I’m not an expert at handling data like this. Last week people asked for the mode and the mean of these scores. I can manually make the charts that people ask for, but in the long run it seems like it would be more efficient if I just pass along what I have and let everyone see the raw data for themselves.

I’ve spot-checked a few games in the list and they seem to match what I find online, but there are over 4,000 games in this list and there’s no way I could check them all.

Let’s Talk About Metacritic

Every time Metacritic comes up in conversation, there’s always this back-and-forth where people speculate how it works and we try to find a signal within the critical noise. Also, stuff like this does not inspire confidence:

I can't give away our secret, but we came up with the idea of using MATH on DATA. Our engineers are very clever.
I can't give away our secret, but we came up with the idea of using MATH on DATA. Our engineers are very clever.

First off, they advertise the site as a service to “help you find stuff you’ll love”. That’s… that’s not how people use the site. I’ve been in this game critic business for a long time, and I’ve never heard of anyone scrolling through Metacritic to find games. The site doesn’t even contain information required to make that a worthwhile thing to do. Like, if I wanted to find top-down turn-based fantasy RPGs based on the Dungeons & Dragons license, then Metacritic can’t help me. Sure, it can help you look for “action games”, but that’s like saying you need a restaurant guide to help you find “food”. If you’re looking something up, then you probably have something specific in mind.

Sometimes people consult Metacritic to guide purchasing decisions for games they’ve already found elsewhere, but that’s it. For the most part, Metacritic is part of the post-launch conversation where we try to figure out what the public thought of it.

But fine. Metacritic either doesn’t understand how people use the site, or the leadership is trying to re-position the site within the market. Whatever.

The more concerning thing is that the site brags about their “proprietary” system for calculating the metascore. That’s alarming. I always assumed the metascore was just an average of critical reviews. Which of these things is true:

  1. Metacritic is trying to sound smart by claiming that taking the average of a bunch of numbers is a “proprietary” technique.
  2. Metacritic isn’t actually taking an average. Instead they’re weighting certain scores (how?) by certain reviewers (who?) to make a more accurate (according to whom?) number. If the site is massaging the numbers according to unknown criteria, then how do we know they’re not outright manipulating scores to make money? Without transparency, there can’t be any trust.

If they’re doing #1, then they’re dishonest. If they’re doing #2, then they’re being REALLY dishonest.

I’ve always wondered why Metacritic seemed like such an awkward site. Then I glanced at the ToS page to find:

That explains it. Metacritic is owned by an Old Media dinosaur. That probably explains why they think having an opaque “proprietary” system with no accountability is a selling point. 30 years ago, this would impress people. But to tech-savvy gamers, this probably triggers instant suspicion and distrust.

An average of all critic scores is data. Metacritic’s number is that data, plus or minus an unknowable random number. I really don’t see what value that random value adds to the deal.

But whatever. Here’s one more chart I pulled out of the data:

This is the average number of ratings each game received for the year in question. This isn’t how much they liked a game, this is just a measure of how often they voted.

It might be tempting to look at this as a measurement of the popularity of Metacritic itself. On the surface, this suggests that Metacritic peaked in 2011 and has been gradually declining in relevance since then. However, remember that my data is limited to PC releases. This downward trend from 2011 roughly coincides with the Great Indie Deluge on Steam. Let’s have another look at this chart from last week:

That’s the number of PC games on Metacritic. Instead of seeing the previous chart as a drop in the relevance of Metacritic, it might be more accurate to see it as an increase in the number of tiny games that get very little attention.

Maybe we can control for this by looking at other numbers, but I feel like I’m running into my limits as a statistician. I figure the best way to deal with this is to just crowdsource it. I know there are plenty of people with a lot more experience at this, and we’re likely to get better results if those folks can see the data for themselves. So here it is.

vg_data_02-02-2020.zip

That’s a simple comma-delimited text fileI mean, the file is inside the zip. Just be glad I didn’t stick it inside a .7z file. with all of the 4,331 games in the database. The fields are in this order:

  1. Title of game
  2. The platform, which is always “PC”.
  3. The release date.
  4. Publisher
  5. Developer
  6. Metacritic’s proprietary bastardization of the aggregate critic score. (Hopefully this is something close to the average.)
  7. The number of critics that submitted scores.
  8. The averageWe assume. user rating.
  9. The number of user ratings.

Note that there are holes in the data. Sometimes publisher and developer fields are blank. Also, games like Total War: THREE KINGDOMS – Mandate of Heaven don’t have any user ratings. This shows up in the list as a -1.

You could do a lot more with this data if we had numbers for Playstation / Xbox / Nintendo titles for comparison, but this is what we have for now. If you come up with some cool info, then leave a link to your results on your blog, Pastebin, imgr, Instagram, GitHub, PornHub, or wherever people upload this sort of thing these days. Leave the link in the comments belowI’m kidding about PornHub. I’m 99% sure my spam filter will eat any such links. and we can all have a look. Maybe we’ll learn something new. Maybe we’ll just waste each other’s time and get lost in pedantic arguments about mean vs. mode. I don’t know. Assuming we learn something, I’ll create a follow-up post later.

Hopefully Don Data will have more data for us in the future.

EDIT: In the comments, Lino expressed an interest in having the genre info. So here is a dump that includes it:

vg_data_02-02-2020-wgenre.zip

I don’t know how accurate it is, and I’m willing to bet it isn’t complete, but there it is.

 

Footnotes:

[1] Correction: More than two things happened last week. However, I only care about two of them right now.

[2] I mean, the file is inside the zip. Just be glad I didn’t stick it inside a .7z file.

[3] We assume.

[4] I’m kidding about PornHub. I’m 99% sure my spam filter will eat any such links.



From The Archives:
 

48 thoughts on “Video Game Chart Party 2: Chart Harder

  1. Grimwear says:

    It’s finally my time to shine. It’s all on the front page boss.

  2. pseudonym says:

    One thing that will happen when going through a big dataset is that you will find patterns. That is a given. These patterns may or may not mean something. They might just as well be random. So going trough a dataset, looking for patterns is an incorrect approach. This is called data dredging by wikipedia: https://en.m.wikipedia.org/wiki/Data_dredging

    The correct approach is to think of an hypothesis. For example “user and critic scores have been diverging” or “the metacritic overall score is the average”. Then this hypothesis can be tested with a reasonable chance of not being a false positive if the data is sufficient.

    So are there any hypotheses that need to be tested?

    1. Decius says:

      One valid use of dredging data for patterns is to form hypotheses about future data. It’s dangerous, because about half of all binary propositions will be true in future data.

    2. kincajou says:

      I agree with this sentiment, although i would argue that the hypothesis doesn’t necessarily need to be hyper precise (e.g. a simple, “is there a correlation between user and critic scores?” is already a good starting point as to what specific data to look for)

      i would also like to add that personally i find there is very little advantage to presenting the data as bar charts rather than “scatter” or “line” plots. If you’re interested in looking at trends, the line graph will give you the cleanest representation of trends (especially if you want to look at more than one thing at a time). If you want to be more accurate, the scatter plot is probably the best, the “scatter + line” would give you the advantages of both but starts being visually cluttered.

      Finally, a graph that would be interesting to plot is user review scores vs critic review scores (x vs y, make it a scatter) where you can use the standard deviations to create your error bars. If users and critics are consistent with one another, then you’d expect all your data points to lie within the diagonal, as soon as you diverge from that diagonal then either critics or reviewers are being more generous. By looking at the size of the error bars you can also tell quite how varied the data is between people.

      this would answer the question “do reviewers and critics score games equally?” it would also allow us to see if there are divergences at big/small scores (“do great/very poor games cause large differences between critics and reviewers whilst “average games” are considered universally “average”?”)

      There is enough here that in an academic environment you’d probably get a solid 1-2 publications (maybe low level, but nonetheless) out of this lot of data (if it was in a scientific field, i don’t know how much interest exists for videogame sales data) with proper statistical analysis, method explanation, and a suggestion for the observed trends.

    3. Gethsemani says:

      It might be worth noting that in Academia Data Dredging is very much looked down upon, as it is generally considered a last ditch effort to get something, anything from your data after you’ve failed to prove/disprove your initial hypothesis. In that capacity, as an attempt to salvage a research project or thesis, it is often combined with unsavory tricks like the Texas Sharpshooter (cherry picking data clusters and/or patterns) to let you get something out of your data instead of having to discard it because it failed to prove or disprove your hypothesis.

      On the other hand, a hypothesis for data gathering purposes can be very, very broad. “Is there a correlation between the numbers of released games between 2010 and 2020 and the average user review score between those years?” is a pretty broad question and can lead to a lot of follow up studying. Like what is the correlation, is it universal or specific to some genres, developers or even game series? In that way, exploratory analysis is not necessarily bad, as long as you remain stringent in forming a hypothesis for each of your follow up questions instead of just randomly browsing the data waiting for a wild pattern to appear.

  3. hewhosaysfish says:

    I’m no statistician either but the numbers in the infographic average to 58.13(3)

    But mayne somebody in marketing just made up numbers that they thought looked interesting, rather than actually using The Algorithm.

    1. sheer_falacy says:

      That’s actually hilarious. Hopefully someone was just making up numbers but they sure did create a thing that makes metacritic look like garbage – here’s these review scores ranging from 21-85, with a heavy weighting on the mid-sixties! We decided this should become 82.

    2. Dev Null says:

      Beat me to it. Looking at their own brag-ad, and seeing 30 numbers, 5 of which are _barely_ higher than 82, and 25 of which are _significantly_ lower, and I’m thinking someone didn’t think this through very well. Unless the ad is to pitch their services to game developers, in which case I can see the appeal. “Turn those frowns upside-down!”

  4. ivan says:

    7. The number of critics that submitted scores.

    Is that how they labelled the column? I dunno how Metacritic works, but, is that how it works? Critics have to submit their scores for something? Or, more probably, they have to apply to be allowed to submit their score. And, do they have to do it manually, for every single game, or what?

    1. Shamus says:

      Internally? I have no idea how Metacritic labels their data. I’m just describing the text file. I’ve always assumed that you have to apply to become a recognized critic by the site and submit scores, but I don’t know how it really works.

      1. ivan says:

        I really can’t imagine that’d be how it works. What incentive is there for a reviewer to submit anything? It’s extra work, in order to give their audience an alternative to watching/reading their review directly. Unless Metacritic pays them for submitting their scores, that just seems like a lose-lose deal.

      2. Falcon02 says:

        I recall reading/hearing other critics (e.g. Jim Sterling) complaining Metacritic converted their score-less reviews into estimated scores without asking, so Metacritic could have more data.

        So my impression is Metacritic actively looks for, converts to their scale, and integrates without requiring active Critic participation/applications. Metacritic has every incentive to pull in as many critic reviews as they can to ensure contunued relevance. The critics… not much reason to care about being in metacritic… they already reach their audience, and metacritic just dilutes their critique.

      3. I also doubt that Metacritic requires critics to submit their own scores. Reviewers that use more of the 0-10 score interval than others sometimes get a ton of angry fans leaving comments on their sites, because Metacritic uploaded their score onto the site, and it stuck out like a sore thumb compared to the other scores. I’m thinking of people like Tom Chick, who use more of the interval and sometimes have opinions out of step with the standard critic.

  5. Lame Duck says:

    “I’ve never heard of anyone scrolling through Metacritic to find games.”

    I occasionally do that…kind of. Specifically, I look through the list of all games for an older console to see if there’s anything I’m interested in that I missed and sorting by Metacritic rating at least pushes the real awful schlock like tie-in games and Sonic 2006 down to the bottom of the list. Beyond that, the ratings don’t influence my interest, though.

  6. John says:

    Ah, fresh meat.

    Thanks for the link, Shamus. It’s unfortunate that the data is aggregated in this way, as it limits the analysis we can do. It’s unlikely that we’ll be able to say anything conclusive, statistically speaking, but I have a few exercises in mind already. I was going to waste my afternoon playing video games, but this is much better.

    1. John says:

      Update:

      Writing the data-import code in Java is proving trickier than expected. This may be a comma-separated text file, but some of the fields sometimes contain commas as part of their data. I blame “Invisible, Inc.”, “Atari, Inc.”, and all those games who were so inconsiderate as to have separate publishers for separate regions.

      You know, when I started this morning I did not expect to have to write nearly so much string-parsing code.

      1. SupahEwok says:

        Could Shamus export as tab delimited? I can see both commas and periods casuing issues with game titles, but tab delimited should get around that.

        Edit: Also, you may want to give R a try. Free open source stats program. RStudio is the IDE we’re using in my grad school statistics classes.

        I’d give it a shot at applying some stuff from class to this myself, but I’ve actually got to catch up on homework from said class for the last couple of weeks today…

      2. King Marth says:

        There should really be existing libraries to do this for you, CSV is an old format. The canonical approach is that fields containing commas are escaped by quotation marks, to distinguish the one-column “Invisible, Inc.”, from the two-column Invisible, Inc.,
        Literal quotation marks are then doubled.

        org.apache.commons.csv.CSVParser should cover you. Instead of the fiddly understandable problems of string parsing, try the fiddly opaque problems of integrating external libraries!

        Alternatively, open the CSV in Sheets and re-export as tab-delimited, tabs are less likely to be in these fields even though the same escaping rules apply. Note you’ll still need to handle “” cases.

      3. John says:

        . . . and the text parsing is done. After correcting for fields with commas in them, I also had to account for records with missing publisher or developer data and perhaps a dozen lines with carriage returns in the middle. Now, for the analysis!

    2. John says:

      Some preliminary analysis for 2010. A histogram for the difference between the Metacritic reviewer score and the average Metacritic user score on a per-game basis (figures in parentheses are relative frequencies).

      -40 | 1 (0.01)
      -30 | 2 (0.01)
      -20 | 3 (0.02)
      -10 | 40 (0.26)
      0 | 72 (0.48)
      10 | 19 (0.13)
      20 | 11 (0.07)
      30 | 2 (0.01)
      40 | 1 (0.01)

      For example, there is one game where the user score exceeded the critic score by between 30 and 40 points and two games where the user score exceeded the critic score by between 20 and 30 points. At the other end of the scale we have two games where the critic score exceeded the user score by between 30 and 40 points and one game where the critic score exceeded the user score by at least 40 points. We can see that about 48% of the time, the critic score exceeds the user score by between 0 and 10 points and that the critic score is within ten points of the user score 74% of the time. (I’ve done histograms with finer detail, but I’m not sure that it would be useful to reproduce one in a comment thread.) I’m still considering the most useful and appropriate hypothesis tests given the nature of the data. Also, I should go back and compute the exact median.

      Here are some notable games from 2010, by which I mean games for which the review and user scores differed by more than 20 points. There are more than I thought there’d be. The figure in parentheses is the difference between the reviewer score and the user score. A positive number means that reviewers (on average, presumably) liked it more than users and a negative number means the opposite.

      Starpoint Gemini (-37) with 9 critic reviews and 57 user reviews
      World of Warcraft: Cataclysm (35) with 53 critic reviews and 1094 user reviews
      Tom Clancy’s HAWX 2 (29) with 12 critic reviews and 55 user reviews
      Call of Duty: Black Ops (27) with 29 critic reviews and 1839 user reviews
      Darkstar: The Interactive Movie (-27) with 7 critic reviews and 15 user reviews
      FIFA Manager 11 (21) with 9 critic reviews and 26 user reviews
      Arcania: Gothic 4 (23) with 25 critic reviews and 403 user reviews
      Blade Kitten (-23) with 10 critic reviews and 43 user reviews
      Cart Life (21) with 8 critic reviews and 21 user reviews
      Magic: The Gathering – Duels of the Planeswalkers (21) with 9 critic reviews and 50 user reviews
      Tom Clancy’s Splinter Cell: Conviction (29) with 20 critic reviews and 738 user reviews
      The Settlers 7: Paths to a Kingdom (28) with 33 critic reviews and 224 user reviews
      Command & Conquer 4: Tiberian Twilight (44) with 71 critic reviews and 680 user reviews
      Silent Hunter 5: Battle of the Atlantic (25) with 26 critic reviews and 109 user reviews
      Hotel Giant 2 (34) with 7 critic reviews and 21 user reviews
      Vancouver 2010 – The Official Video Game of the Olympic Winter Games (23) with 12 critic reviews and 16 user reviews

  7. Michael Anderson says:

    I agree that the ‘old media’ model of ‘super secret algorithm’ is troubling, I much prefer when I can read something like 538 where they get into the details of their models and algorithms in a way that is actually useful and leads to further discussion in the analysis community (and then further refinement and iterations).

    My understanding of how Metacritic works from many years ago when they were less opaque is like this:
    – Everything needs to be put on a 0-100 scale,
    – Everything needs to be weighted so that a 10 point difference in scores means as close to the same thing as possible across sites.
    – We weight things based on either ‘reputation of critical site’ or ‘money spent on advertising’, depending on who you ask.

    Standardizing and Normalizing scores is actually very complex – you have some sites that use a 0-100 scale, while others use a 0-10, others use ‘stars’ generally 4 or 5 as a top mark, and for a while they even counted a few ‘Buy it / Try it / Skip it’ sites (I did some reviews for one of those, they dropped the site ages ago).

    Suddenly you need to come up with a way of equating a 78% with an 8/10 (easy), 4/5 (still OK), and maybe 3/4 stars (getting tougher. But say that first site gives a game 70/100, then you get a 7/10, but 3/5 and 4/5 are equidistant and 3/4 stars remains the best guess. And also remember you need to invert these – so that you get a 4/5 star and know it can map anywhere from 70 – 89 on your grand scale.

    Then you deal with the ‘scale usage’ – what used to be referred to on forums as the ‘7-9 scale’ … this means that for most sites, you will see nearly everything fall within a rating of 7-9, and even the most discerning sites rarely used below a 6 and the 10 was likewise incredibly rare. That further compresses the range to a 3 point differential.

    I recall Metacritic saying that they parsed the text of the review to help differentiate a score on the 100% scale from whatever small-scale score the site used. Sounds very tricky.

    And then you have to account for site bias – some sites have a mean of 4/5, others perhaps 3.5 or 4.5 based on how they distribute the stars. Metacritic used to say that they took a sample of common games used across most sites as a ‘leveling metric’.

    All of this stuff makes it incredibly complicated and also very tied to how you choose to design the weighting and conversion algorithms … and you are starting with something very subjective in nature that seldom scales with time! (I did a few hundred game reviews for a small site in the early 00s and looking back through them I found that my score on a 5-star scale would ebb and flow and occasionally feel random or not at all about the game in question).

    Look forward to checking out the data! :)

    1. Decius says:

      Note that Metacritic appears to use a 40-point scale, from 60-100. The 6-9 point scale can easily map to that.

  8. LCF says:

    “Hopefully Don Data will have more data for us in the future.”
    Hopefully next batch won’t come attached to a freshly-severed horse head.

    1. Lino says:

      As long as the horse doesn’t come from Troy, I wouldn’t worry about it…

  9. Lino says:

    Thanks you very much for sharing! From playing with the data, I found that out of the Top 100 titles with the highest difference in Critic and User Score, 81 of them had a publisher. Out of those titles:
    – 13 were published by EA
    – 9 were published by EA Sports
    – 6 were published by Activision
    – 6 were published by Blizzard
    – 5 by Ubisoft
    – 4 by Sega
    – 4 by Konami
    Although it should be noted that the title with the highest difference (Out of the Park Baseball 17, with a difference of 60) was self-published.
    In terms of the big picture, however, I don’t think having a publisher is a good predictor of difference in scores (if anyone’s interested, here are a couple of charts I drew up). If I have the time, when I get back home I’ll try to apply a predictive model to see if that’s the case.
    I’ll also try and determine the genre of the games. But I’ll need to find a public dataset of the Steam tags for each game (I don’t know if Steam DB has that info in an easy to access place). I’ve already matched 2163 titles with their genres, using the dataset evilmrhenry linked in last week’s Comment Section. Hopefully, I’ll manage to match the rest of them.

    1. Shamus says:

      My source data had genre info. I haven’t looked to see how much of it was filled in, but I know I saw some genre descriptions in there. If you think it might be useful, I could get you a dump of the table with genre included.

      I’ll also note that genre info is really random. For Far Cry 3:

      Wikipedia says: First-person shooter
      Steam says: Action, Adventure
      Metacritic says: Action, Shooter, Shooter, First-Person, Modern, Modern, Arcade

      So… that’s pretty random.

      1. Lino says:

        Yes, please! Although it could take some cleaning, it might be useful…

          1. Lino says:

            Thank you!

      2. Decius says:

        Arcade?

        I’m not particularly clear on what that tag means, but if Far Cry 3 matches it I have no idea what Metal Slug is.

  10. evilmrhenry says:

    I played with the data a little, and I think you might have included -1 user review scores in the average. Obviously this would be an issue.

    1. Part of the problem is that the data has -1 user review scores for a lot of games with a non-zero number of user reviews, so just filtering out games with no user reviews won’t remove them all, even though it should.

      1. Duoae says:

        Yeah, I was coming to leave a similar comment. Coffee Talk – the first entry (according to the way the file was parsed for me) has a user score of “-1” but no. of users as “3”… Of course, cross-checking that against what’s on the metacritic website also causes more confusion because there are “7” user ratings with an average score of “9”.

        So, there’s more going on here with regards to user reviews. I checked this against “All Walls Must Fall” which was a 2018 release, in order to remove titles which are currently being updated (i.e. recent releases) and found that it has 2 reported user reviews but the webpage is reporting that 2 more user reviews are required to have a user review. I assume that each of these values is correct, meaning that the user score requires a minimum of 4 reviews in order to be calculated…. which seems reasonable to me (though maybe the minimum required number could do with being higher).

        With regards to <a href="https://www.metacritic.com/game/pc/all-walls-must-fall/critic-reviews“>the way metacritic scores reviews, I’ve only checked out a couple of titles but it appears to be a simple average (though rounded based on the “true” numbers, not the rounded individual scores) of the scores they present for the individual critic scores (For example, All Walls Must Fall averages to 71.27, which is rounded to 72). However, to get the individual scores, there is some sort of secret sauce being applied.

        E.g. GameStar reviews the game as 3/5, 3/5, 3/5, 3/5 & 4/5. I average this to 64% but the assigned score to the review is 69.

        In contrast, IGN Spain reviews it as 6.8 which is translated to a score of 68 – which makes logical sense to me.

        Given at all the other reviews appear to be direct translations to the final individual scores, it’s weird that this one was not. I should note that outlets that did not assign any “rating” to a game were not counted in the average (i.e. there are 13 critic reviews of All Walls Must Fall, 11 rated – the database counts 11 and there’s no “interpreting” of the unrated reviews but they are there for users to read.

        1. Paulo Marques says:

          > the first entry (according to the way the file was parsed for me) has a user score of “-1” but no. of users as “3”

          Well, that one is easy to explain, games need 4 scoring sources to show a score – which also implies that the data came from web scrapping the page, not some leak.

  11. Adam says:

    I wonder if this can be matched up to say, Steam API for per-game data https://steamcommunity.com/dev

    Here’s a hypothesis – difference between critic and player reviews is related to game length i.e. longer games are less fully played by reviewers than by players. https://howlongtobeat.com/ or similar might be able to provide the data to test this somehow??

  12. GoStu says:

    Aggregating review scores like this seems like a pain.

    While simply presenting a straightforward average sounds tempting, I’m not sure it’s possible without some manipulation on the part of Metacritic. There’s some outlets that review neatly in easily-converted numbers, but some that refuse to assign numerical values. I’ve heard of at least one reviewer who uses a scale of “buy it”, “buy it if it’s on sale”, or “don’t buy”.

    (As an aside, I’d love to see a review site that assigned ratings in the style of Michelin Stars: “no star” isn’t to be taken as negative, one star being a standout good example of the genre, two stars to be “get it even if you’re not normally into that genre”, and the fabled three stars being “I don’t care if you don’t even have the platform, get it just to play this game”)

    I can absolutely understand weighting some outlets differently too. If there’s a popular but sensationalist reviewer who’s apt to slap a hard zero onto something, maybe dampen them a little, etc. Of course, once you open Pandora’s Box of meddling with numbers, then you’re really slipping into the territory of assigning your own opinion…

    There’s no easy solution here to be a Metacritic.

  13. Looking year-on-year at the games that don’t have a -1 mean user rating, one thing I’m seeing is that last five years or so show that the difference between the Metacritic score and the mean user score is starting to trend downwards, i.e. the users are tending to be more critical. This is true even if you look at the middle 50% of the differences, so you’re excluding the games getting bombed with a ton a negative user reviews for whatever reason.

    R script below, for people using it. The “-” symbol in the definition for y will need to be retyped if you paste it into RStudio, it’s being re-encoded when written here.

    library(ggplot2)
    library(lubridate)
    library(data.table)

    x <- fread("vg_data_02-02-2020.csv",
    col.names = c("title", "platform", "release", "publisher", "developer",
    "metacritic.score", "n.critics", "mean.user", "n.users"),
    colClasses = c("character", "factor", "Date", "factor", "factor", "integer", "integer", "integer", "integer"))
    x[n.users == -1]$mean.user <- NA_integer_
    x[n.users == -1]$n.users <- 0L

    y <- x[!is.na(mean.user) & mean.user != -1][, c(.SD, .(score.diff = mean.user – metacritic.score))]
    tidied <- melt(y[, .(release = as.factor(year(release)),
    metacritic = metacritic.score, users = mean.user,
    `user preference` = score.diff)],
    id.vars = "release", variable.name = "scorer", value.name = "mean score")
    ggplot(tidied,
    aes(x = release, y = `mean score`, fill = scorer)) +
    geom_boxplot()

  14. evilmrhenry says:

    I feel Metacritic should just switch user reviews to “positive, neutral, negative” directly, instead of having a 0-100 scale where 0 and 100 are the most common values.

  15. Lino says:

    I’ve got some work to do today, but I had time to do some cleaning on the data. Here’s what I’ve got so far.

    Once I get some work done, I’ll play with the data some more.

    A quick run-down of the columns:

    – Columns 1-9 are just like in Shamus’ original file
    – Col 10 – date converted to year
    – Col 11 – does the game have a publisher
    – Col 12 – if there’s a user rating, what’s the difference between the Critic and User Rating
    – Col 13 – genre, based on the dataset shared by evilmrhenryin last week’s thread
    – Col 14 – the genres from Shamus’ original file. I’ve cleaned them up somewhat – removed some of the redundancies, and somewhat matched them to the file linked by evilmrhenry, where the genre data was better
    – Col 15 – joining Cols 13 and 14.

    Sheets 1 and 2 are used for manipulating the data. Out of them, Sheet2 is the only one that’s worthwile – on the left, it’s got the cleaned genres from Shamus’ file, and on the right – the genres from evilmrhenryin’s file.

    There are some disadvantages to my file. For one, Cols 13-15 are basically “WELCOME TO THE JUNGLE!”. I’d like to do some more work on it, but I don’t know if that’s gonna happen today…

    1. Lino says:

      Some disadvantages I didn’t have time to edit into my original comment are, among others, that my cleaning has very likely lead to some non-trivial loss of nuance when it comes to genre. I’m also not crazy about every other game being classified as Action (WTF does that even mean!?!?).

      But it’s what I’ve got for now (some WIP graphs notwithstanding).

  16. Duoae says:

    The only really interesting thing I can see (without going into genre data) is that for the year ranges below, users rated games more favourably than the median user ratings per year across the whole range.

    What I mean is that the number of games with “higher or equal user ratings” to “critic ratings” divided by “the total number of games included in the dataset for that year” as a percentage were higher than the median value for the entire datarange (1996-2020). This doesn’t mean that a user rating was higher than a critic rating but it does mean that the game had a better “perception” than the median for all games on a yearly basis. I suppose you can argue whether this really classifies as “perception” :).

    User perception per year >= median total user perception across all years
    1996-1999 – 2 years
    2000-2009 – 8 years
    2010-2019 – 2 years

    This might imply that the decade of 2000-2009, users were more happy with their gaming experiences, on the whole and, conversely, were more dissatisfied the following decade. I don’t really have enough data for the decade prior but from the limited data we have available, it seems users were about even (2 year out of 4 possible).

    Critic perception per year >= median total critic perception across all years
    1996-1999 – 2 years
    2000-2009 – 2 years
    2010-2019 – 7 years

    Conversely, critics had a better “perception” rating compared to the median of the percentage of titles rated more favourably by critics over the decade of 2010-2019 but not the prior two decades – with the same caveat applying to the 1996-1999 time period.

    If we take each year within the same year ranges used above as a function of percentage of games that were scored higher or equal to critics by users:

    Percentage games scored equal to or higher than critics per year
    1996-1999 – 1 years
    2000-2009 – 5 years
    2010-2019 – 0 years

    My totally unwarranted conclusion is that while I think it’s safe to say that we’ve not had no good games over the last 10 years, there’s been a confluence of trends where users are rating games more poorly, reviewers are rating games more positively and there are increasing amounts of games being released. I mean, almost 60% of games reviewed in this dataset were released in the period of 2010-2019.

    This might reflect the reduced ability of reviewers to competently or thoroughly review an increased number of titles, more focussed reviewing (i.e. speciality outlets focussing on titles in their chosen genre might review them more favourably than an outlet that is more generalist in coverage) and/or the increased ease of access and knowledge of user reviewing site, Metacritic (i.e. more users are “online” than ever and can become invloved in review campaigns from their usual internet haunts) and the increased ability of users to “out review” the reviewers simply through their pure numbers…

    1. Duoae says:

      What does appear to be interesting is that user “use” (i.e. number of reviews per year) of Metacritic “peaked” in 2013, the year of release of both PS4 and Xbox One, with numbers normalising to pre-2011 levels in the 2016-2017 period. However, this trend doesn’t have a corrolation with the “perception” of users which sees a precipitous downward trend from 2015 onwards, with 2013 and 2014 being the last two “positively perceived” years.

      Could this uptick in users from 2010-2013 have been related to growing disatisfaction with the quality of console gaming experiences available on the Xbox 360 and PS3, combined with better network access and reductions in computing hardware costs, resulting in users migrating to the PC platform?

      Shamus, do we have the data for the consoles included in the MySQL database? It might be interesting to compare each “class” of user and usage amount over the same time period.

      [edit] I just noticed that this “peak” in the PC data also correlates loosely to the rise and fall of WoW concurrent subscriber numbers. WoW (apparently) hit its peak of 12 million in 2010 and then slowly reduced to 10 million in 2014 and by 2015 that sharply reduced to 5.5 million. (From wikipedia).

      According to this page, if it can be believed, that’s since fallen even more.
      https://www.statista.com/statistics/276601/number-of-world-of-warcraft-subscribers-by-quarter/

      It could reflect a general reduction in people playing, or caring about higher end gaming experiences. Sure, the PC market is the largest of all markets, but a large proportion of that is very low-end web browser stuff that doesn’t require graphics acceleration. Same as how the mobile market is “huge” but I bet there’s more people writing about each Call of Duty than there are about each Zen solitaire (or whatever a variant might be called).

      1. Shamus says:

        “Shamus, do we have the data for the consoles included in the MySQL database?”

        No. Just PC games.

        1. Duoae says:

          Ah, okay. Nevermind then. Thanks :)

  17. Ninety-Three says:

    Metacritic isn’t actually taking an average. Instead they’re weighting certain scores (how?) by certain reviewers (who?) to make a more accurate (according to whom?) number.

    Yes, this is what’s happening, and they don’t try to hide it (it’s at the top of their FAQ). You can prove that they’re doing something other than taking a raw average by looking at games with few review scores where it’s common to notice that the simple average does not equal the METASCORE™. Supposedly:

    This overall score, or METASCORE, is a weighted average of the individual critic scores. Why a weighted average? When selecting our source publications, we noticed that some critics consistently write better (more detailed, more insightful, more articulate) reviews than others. In addition, some critics and/or publications typically have more prestige and respect in their industry than others. To reflect these factors, we have assigned weights to each publication (and, in the case of movies and television, to individual critics as well), thus making some publications count more in the METASCORE calculations than others.

    In addition, for our movie and music sections, all of the weighted averages are normalized before generating the METASCORE.

    I mentioned this last week, but weirder than the fact that they do it is the fact that they keep their weightings secret. It’s not like this is the Google spam-detection algorithm where if you tell people what you’re doing the spammers will immediately know how to bypass it. You can’t game this system, and really, I doubt any review outlets care enough to try to manipulate what another site says about how good a videogame is. The natural suspicion is that they’re just creating a cover of vagueness that gives them room to manipulate scores, but how exactly do you profit off that? Take bribes? It’s a hell of a conspiracy if they’ve been taking bribes for a decade and not a word has leaked. Besides, if they really were in the business of taking bribes for score, you’d think that the publishers would know better than to write contracts tying massive studio bonuses to their game’s METASCORE.

    My preferred explanation is that the weightings exist for dumb bureaucratic reasons and they’re kept secret to make them seem fancier. Some Metacritic employee invented them to impress old media dinosaurs: either so they could throw phrases like “proprietary algortihm” into their sales pitch, or because he wanted a promotion so he built a useless fiddly system to look like he was Getting Work Done.

  18. Ninety-Three says:

    Hopefully Don Data will have more data for us in the future.

    Could you be a little less cute about the provenance of this data? It sounds like you might be implying you’re going to put up an even more detailed article and dataset next week, in which case it wouldn’t be a good use of time to look too closely at this week’s incomplete data. Is this something you went out and gathered yourself, some curiosity you found on the internet, do you actually know a guy who’s handing you this stuff?

  19. Metacritic must die says:

    It’s the second one. Metacritic weights the scores. Now, that’s not always statistically wrong, since some outlets are more influential and representative than others, who may have proven themselves to be outliers in the past. What IS statistically wrong is to not be transparent with that algorithm, and provide weighting free scores. This is poor data presentation, and the scores are entirely unreliable as a result. The weighing could be completely irrational, and we wouldn’t know, because it’s a black box. Metacritic is 100% garbage, from a statistics standpoint.

    And additionally, averaging subjective scores which are meant to reflect opinions is poor data reduction. If one guy says it’s an 8/10, and someone else says it’s a 2/10, the average score of 5/10 is not likely to be my opinion, and it completely obfuscates why those scores were given-one of these guys thought the game was the greatest, another thought it was garbage, and the site is telling me it is mediocre-which is unlikely to be the real answer.

    Add to that the way critics use only a fraction of their scale and the reticience towards giving perfect scores, and you have a system that is not set up on a linear value which is given according to a standard that can be averaged-you are putting in nonsense and getting nonsense out.

    I know it gets a lot of hate, but Rotten Tomatoes actually has the best system. There is no weighting, the Tomatometer score shows what fraction of polled reviews were OVERALL positive out of the whole. That’s the only way to handle messy data like that, be broad and instead of speaking in qualitative terms, speak in quantitative terms about the population. So something that gets 80% in the tomatometer, well, there’s an 80% chance that of any of the polled users, they would like it, and that’s a proper sampling method, if you could get enough samples, you could confidently say what portion of the population will enjoy something.

    Does it tell you if it’s good, or how good? No. No aggregate number can do that. It can only tell you what proportion of people liked it, and measure popular appeal.

    Of course, it’s not a random sampling either. Their critic scores are based on what they regard as trusted publications, which limits it to critics, and the audience score is so easily gamed it ought to be removed. But at least it’s not based on a fundamentally bad understanding of maths. It’s like someone saying “Let’s take the average of an apple and an orange”, completely meaningless gibberish.

  20. Noah says:

    For what it’s worth, I spent a while staying one generation behind on consoles, and picking up a new (to me) one after it was discontinued/superceded, and buying cheap used games. When I got a PS2, GBA, or DS, one of the things I’d do is find sites that rated all the games for that platform and see what was at the top of the heap, as a guide for what games I should be looking for at The Exchange or wherever. When I finally get a Switch next month for Animal Crossing, I’ll probably do something similar on Metacritic, since outside of the big name first party games and some indies like Steamworld Quest, I don’t really know what the gems are that I should be putting on my wishlist. I recognize that this will probably happen very rarely with PC games, however – if someone gets a new gaming PC they probably want whatever the new hotness is to show off its capacity, and otherwise they’re probably just getting cheap games from Steam sales and bundles.

Thanks for joining the discussion. Be nice, don't post angry, and enjoy yourself. This is supposed to be fun. Your email address will not be published. Required fields are marked*

You can enclose spoilers in <strike> tags like so:
<strike>Darth Vader is Luke's father!</strike>

You can make things italics like this:
Can you imagine having Darth Vader as your <i>father</i>?

You can make things bold like this:
I'm <b>very</b> glad Darth Vader isn't my father.

You can make links like this:
I'm reading about <a href="http://en.wikipedia.org/wiki/Darth_Vader">Darth Vader</a> on Wikipedia!

You can quote someone like this:
Darth Vader said <blockquote>Luke, I am your father.</blockquote>

Leave a Reply to Dev Null Cancel reply

Your email address will not be published.