I like the discussion; thanks for getting the ball rolling here, John. There are a lot of good ideas here. I'd like to comment on a few of them, but I'll split this up into several posts to avoid a ridiculously long post. I'll start with the issue where the solution seems most clear-cut to me: the rating behavior that Liam described.

In Liam's comments on ratings, he pointed out that scoring by game produces very different results from scoring by round. He is absolutely right, but I think this is relatively easy to remedy. Our rating system is an Elo rating system. Elo rating systems work by forming an expected outcome for a contest based on the ratings (oftentimes based on just the difference in competitors' ratings, as in our system - I think) and then adjusting ratings afterward based on the difference between the outcome and the expected outcome: if a player does better than expected, his rating will rise, and his rating will fall if he does worse than expected. The expected outcome should be the average outcome if the players (or, rather, players with the given ratings) were to play many times. Since the Elo rating system corrects itself, ratings will tend to move toward a sort of equilibrium where the expected outcome from the rating formula is (roughly) equal to the hypothetical average outcome (of course, the ratings will always bounce around somewhat, rather than ever really reaching an equilibrium). Since the hypothetical average outcome is different for different scoring methods, our rating formula should compute a different expected outcome depending on the particular scoring method.

Here's an entirely made-up example, and I hope Alex doesn't mind that I'm using him in my example. Let's say that when I play against Alex M. I have a 25% chance of getting 1 draw and 1 loss, while I have a 75% chance of losing both games. If the event is scored by game, this means I have a 25% chance of getting 1 point (yay!). My hypothetical average outcome would be 0.25 pts for each round that I played against Alex. If we were to play many, many times with game scoring, I would expect the difference in our ratings to end up close to the amount that would yield an expected outcome (for me) of 0.25 pts per round. On the other hand, assuming the same outcome probabilities as before, I would always end up with zero points in contests that were scored by round (

). My hypothetical average outcome would be zero, and, if we played many times, our ratings would diverge until the ratings formula gave me an expected outcome of zero.

If we ignore the scoring system when computing the expected outcome in our rating formula, we will see the behavior that Liam described: the range of ratings will be compressed by events that are scored by game and expanded by events that are scored by round.

The solution to this undesirable behavior of ratings is to consider the scoring system when computing the expected outcome. In fact, there are other things that could be considered as well. For instance, whether the event is GAYP, 3-move, or 11-man ballot, whether the "tough deck" is used, etc.

We have round-by-round tournament results going back ages (yes?), so I think we have plenty of data to come up with a reasonable estimate of the impact of game scoring vs. round scoring in 3-move, as well as the impact of GAYP vs. 3-move. I'm sure the 11-man ballot data set is much smaller, so we may not be able to meaningfully estimate how the expected outcome for 11-man ballot should differ from the expected outcome for 3-move.