Game Jam Rating Systems and Solutions

I was talking to a friend today about the problems which arise around game jam rating systems. Specifically, ways of encouraging the developers to give meaningful ratings and feedback to a large number of other entries. I refer entirely to online game jams with many entries, where it is difficult or impossible for every developer to rate every other game.

Why does it matter if games don’t get enough ratings?

It can be discouraging to have spent a weekend developing a game which you are really proud of only to have a small number of people play the game. In many game jams I’ve seen fantastic games with less than ten ratings, sometimes with less than five, and this can really dampen the experience since one of the most rewarding aspects of a game jam is having other people play and enjoy your game.

The other negative thing about a game jam where most games get very few ratings is that it makes the winners more and more arbitrary the less ratings there are. Different developers have very different rating methods. Some developers will give out 5s sparingly, saving them for what they consider the best of the best. Other developers will give out 3s, 4s and 5s in equal measure. This is fine when each game has been rated many times (twenty or more is a nice standard), but as the number of ratings drops you end up with more and more cases where a good game is given 5s by very lenient raters while a great game is given 4s by very strict raters (I want to mention that I don’t think there is anything wrong with any method of rating, so long as the method is consistent). This can lead to results which feel hollow or meaningless.

There is also a greater potential for malicious rating. Among other reasons, malicious ratings could occur due to personal reasons (Alice doesn’t like Bob and so gives Bob low ratings) or to game the system (Charlie gives everyone low ratings so that his game ranks higher). One rating out of many is drowned out, but one rating out of a few makes a great difference.

How can developers be encouraged to rate more games?

Different game jams have employed different techniques. Sometimes there is no encouragement. Sometimes there is a leaderboard of the top raters.

For Ludum Dare, a game’s visibility is boosted by the number of games the developer has rated (I’ll call this “involvement”) and dampened by the number of ratings it already has (I’ll call this “popularity”). This leads to very involved developers being very popular and very uninvolved developers being very unpopular. Rate thirty games and you can expect to be rated twenty times. Rate zero games and you can expect to be rated five times. The more you give to other developers, the more you get given. This system also partially solves the issue of web games getting a great deal more attention than downloadable games.

Personally, I think Ludum Dare’s rating system is fantastic and certainly the best rating system that I’ve seen but I think there may be additional things that can be done. I certainly don’t claim that the suggestions in the following sections are necessary or even good ideas. I would like to think of these suggestions as things to consider and discuss.

Additionally, I don’t want to sound like I don’t think developers can be trusted. On the contrary, I think that for the most part game jams produce outstandingly genuine and honest feedback and results. I simply like the idea of making such a system tamper proof. As a result, the discussion mostly focuses on the small number of developers who may not give meaningful ratings.

How do you encourage meaningful ratings?

By meaningful ratings I mean those where the rater played the entry for a reasonable length of time and then rated it thoughtfully. Opening a game and playing it for ten seconds before rating it is not a meaningful rating. Giving every game 3s is not a meaningful rating. Unfortunately, these sort of meaningless ratings are likely to arise whenever there is an incentive to rate other games. The majority of developers will rate meaningfully, of course, but there will likely be a few developers who aren’t interested in playing other games or are too busy to do so, and as soon as they are required to rate a certain number of games in order for their game to be visible or eligible for rankings, meaningless ratings are introduced.

An idea I had to encourage meaningful ratings is to measure how close a developer’s ratings are to the average ratings. The following method applies not only to game jams, but to any system based on mass peer reviews. To better illustrate this, consider the following ratings:

Game Name Mean Rating Alice’s Rating Bob’s Rating Charlie’s Rating David’s Rating
Rockety Rockets 2.9 3 4 3 5
Lazy Lasers 3.2 4 3 3 2
Heavy Hammer 4.6 5 4 3 1
Sharp Sword 1.3 1 3 3 5

Alice’s ratings are very close to the average, Bob’s ratings are reasonably close, Charlie ratings are all the same and David’s ratings are either random or purposefully inverted. This closeness can be measured by the correlation coefficient, ρ. Here are the correlation coefficient values for the previous example:

Name ρ
Alice 0.9802
Bob 0.6396
Charlie 0
David -0.8359

The closer to +1 the value of ρ is, the more accurately the developer’s ratings match the average rating. A value of 0 means there is no correlation at all (the rating is effectively perfectly random). A value of -1 means the developer’s rating is perfectly opposite the average rating. Looking at the first table, I would consider only Alice and Bob’s ratings to be meaningful ratings. The large, positive ρ values reflect this. Any meaningful rating should achieve a relatively high correlation coefficient. The more games each person rates, the more accurately the ρ value measures meaningfulness.

I think this calculation could be used effectively to measure the quality of a developer’s ratings, or to check which ratings are meaningful and which are not. Of course, any such checks would have to be very generous and give each developer a large margin to account for differences in opinion and taste. The system certainly wouldn’t want to discourage people from rating games in a way that reflects how they truly feel.

There are different ways this information could be used, and I think they fall into two distinct methods. Either reduce the influence that meaningless ratings have, or penalize games made those developers who rate meaninglessly. Unfortunately both methods come across as very negative, and either one would have to be finely tuned so as to not negatively affect any developers who have genuinely rated games, while still having some of its intended effect. It would depend on a few factors, such as the number of games rated, but as a rough benchmark I believe that most meaningful ratings will have ρ values above 0.6 with a few having ρ values between 0.5 and 0.6. If these values are accurate, then a ρ value of less than 0.4 could be considered likely meaningless.

More thoughts to come.

There is a lot more I want to say on this topic but for the time being I will leave it here. I will add more details as I find the time to do so. Also note that there is a good chance I have articulated some points poorly and that I may go back and revise things at a later date. In the meantime, if you have any thoughts or questions on the topic please join in the discussion by leaving a comment below.

1 thought on “Game Jam Rating Systems and Solutions

  1. I have thought extensively about the LD ratings system as well. I find it to be adequate at providing an approximation of a game’s quality, and that’s all I expect it to provide.

    Rating things that are inherently artistic creations is extremely difficult, and it’s a bit laughable that people think that these things can be quantified and ranked.

    Of course, it is easy to compare two creations, and state a preference of one over the other. The “problem” is that getting people to agree on rankings is intractible. There will *always* be differences of opinion. People can argue endlessly trying to persuade one another that their opinion is the most factual, who is the most persuasive or has the best taste or insight into what makes a creation “good”.

    Trying to get everyone to agree is not really the point, though. It should be obvious to anyone that that’s never going to happen. (And even good judges can disagree, and both have legitimate reasons for rating a game as they do.) Indeed, one may argue that it is precisely the reason why each individual gets to rate games — so that they may have *their* ratings for the games.

    Imagine if the ratings were private, rather than aggregated and tallied — it wouldn’t matter then, at all, whether judges agreed, or if judges were good or bad judges. The ratings would serve the individual rater, and nothing more.

    And to some extent, the ratings serve (or could serve) that purpose. If I can go back to previous LD events and search on the games based on my subjective rankings that I have given to them, it helps me to organize the pool of games and find the ones I enjoyed playing if I ever want to go back and play one of them again. I don’t think the current web site makes this particularly easy, but it’s possible in principle to use the individual ratings this way.

    It seems to me that getting judges to be good/better judges is closer to the point. This can be achieved by having academic discussions about judging, about games, about what makes a game good, and getting people to think deeply about these things and to care. Having thoughts and discussions like this is essentially an attempt to achieve that.

    While elevating the culture of game criticism to a high level of academic rigor is achievable, and desirable, cultures will always have fringes and frontiers, and cultures are always comprised in part of incoming neophytes who have to learn, as well as old guard sticks-in-the-mud who are prone to being close-minded about the radically new.

    But I think the real point is to get good [enough] ratings. The ratings don’t have to be perfect, just good enough. Good enough doesn’t necessarily mean that the individual judges need to do a good job; it just means that overall the judges need to do a good job on average. And by “good job” the judges don’t need to agree in terms of raw score; they just need to apply their own subjective criteria for judging games evenly across all the games they rate. As long as there are enough judges for the averages to even out, and as long as the judges each rate enough games, and as long as they apply their internal decision making consistently for those games they judge, the outcomes should be reasonable. Not perfect, but reasonable. And I don’t think perfect is practically achievable.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s