Road Bumps in Tournament Design and Application

Do Swiss standings mean anything at all?

Oct 14, 2022

The follow-up to the previous post:

Lessons Learned from First and Second Edition Tournament Formats

Introduction Just before AMG announced the 2.5 transition, Tim Horsburgh and I were working on a document for tournament organizers to use. During the FFG/AMG handover (or lack thereof), there were effectively no resources for anyone trying to run events. To make matters worse, the more time passed, the more organizers that had this knowledge drifted awa…

3 years ago · Chris Allen

would naturally be on how to adjust it for 2.5 tournaments, but in starting to put that together, I decided to go off and check out some historical data for 2.5 and 2.0 tournaments and the results surprised me enough that it’s worth a separate post exploring the details.

A Background on Swiss Formats

There’s a bit of background knowledge required to have this conversation, which is what Swiss tournaments are, and how and why they work the way they do. In an ideal world, all tournaments would not be Swiss, they would instead be another format - Round Robin. The Round Robin process is very simple: all players play all other players in the event. If you have 4 players, it takes a total of 3 rounds to accomplish this. For 8 players, it takes 7 rounds. For 128 players, 127 rounds… and you can see how this gets unfeasible very quickly. So, what’s the best alternative?

What problem are we trying to solve?

There’s a few basic goals of tournaments and formats which we can and will expand on in future posts, but for this discussion, we’re primarily concerned with the purely competitive sense. The goal of the tournament is to give you a ranking of those players, from best to worse. What does that mean exactly? Let’s say you had one hundred players at an event, and an omniscient entity descended and gave you a perfect list of players ranked from best to worst. The goal of our tournament is to create a list that would match the known correct list of players. But really, there are two problems to solve here:

Rank players in order of best to worst
In the smallest amount of time possible

Naturally with unlimited time, Round Robin trivially solves problem 1. Yes, there are some cases where Player A beat Player B beat Player C, who in turn beat Player A, but these are only relevant when their records are also tie, which in games of skill tends to be relatively rare.

Enter Swiss: Tournament Standings on a time budget

Naturally, we don’t have time to play 100 rounds of X-Wing at a tournament! Instead we use Swiss to quickly determine player ranks. Swiss tournaments are also simple and straightforward. Every player starts with no points, and are randomly paired against each other. Players are awarded a certain number of points for Wins/Draws (let’s just use 3 and 1, as it’s what X-Wing uses), and 0 for a loss. The next round, players are paired with other players with the same number of points, and the process repeats each round. How many rounds you can or should play is generally a function of how much time you have, as the more rounds you play, the more accurate your results until you’ve effectively just run a Round Robin tournament. The tiebreaker for players is the Buchholz System, which in X-Wing is what people mean when they refer to Strength of Schedule. Effectively, it is a measure of each of a player’s opponents over the event.

A common adage you’ll hear listening to X-Wing players (and I’m sure I’ve said it plenty of times as well) is that Swiss is not a good tool for generally ranking players, it’s a tool to find the top single player in a pool of players in the minimum amount of time. It’s understandable how this has spread, because every round of Swiss you cut the number of undefeated players in the tournament in half, which quickly leaves you with one undefeated player. But in reality, despite only having Strength of Schedule as a tiebreaker, skill based games are able to accurately rank players by skill in a small number of rounds. The Efficacy of Tournament Designs handles this in much more detail than I’m able or willing to here, but a very quick summary is that the more the outcome of an individual game depends on skill, the faster Swiss results you’re able to get, and the more accurate those standings will be. For example, a chess tournament will approach the true skill standings significantly faster than a poker tournament.

How does all of this relate to X-Wing

The next problem to solve after receiving your god given list of players from best to worst, is how to choose a winner of the event. Which generally requires some sort of elimination bracket, which leads to questions on how to seed that bracket. One major goal of elimination tournaments is to try and have the highest skill players meet at the finals of the tournament, which is a surprisingly difficult problem to solve, especially when multiple qualifiers are involved. This is the subject of the next post!

But, let’s basically consider the problem and say you were given that list of players by some all knowing deity, ranked by skill, and wanted to try and make predictions on who would win games in the cut to generate the best tournament bracket. How accurately are you able to predict the winner of a game, which is really asking, how confident are you that your list of players is actually ranked best to worst?

With some help, I went and dug through all of the recorded cut games of 1.0, 2.0, and 2.5 (post May update) from ListJuggler and ListFortress. This is on the scale of a few thousand cut games for 1.0 and 2.0, and roughly 500 for 2.5.

Two Simple Questions

In cut games, how often does a player with more points from Swiss win against a player with fewer points in Swiss?
In cut games where each player has the same number of Swiss wins, how often will the player with better Strength of Schedule win?

The first question is effectively asking how well a Swiss tournament functions at a base level. The more skill effects an individual game, the higher percentage of the time we’d expect the player with more Swiss wins to win a game in the cut. As an example, in an 8 person cut where the top player was undefeated, how often does that top undefeated player win their first game against the 8th place seed?

The second question is really dependent on the first, and is asking if enough Swiss rounds are played for any tiebreaker to be meaningful. An example here would be the same 8 person cut, where the 4th and 5th seed are tied with the same number of Swiss points, how often does the 4 seed beat the 5 seed?

First Edition

In cut games, the player with higher Swiss score wins their game against the lower Swiss scoring player 50.6% of the time. Or to rephrase, the winner of a cut game effectively could not be in any way predicted by prior Swiss results.
In cut games where each player had the same Swiss score, the player with higher Strength of Schedule won 47.7% of the time. It’s not surprising that this is effectively a coinflip as well, as Strength of Schedule requires accurate Swiss standings from score to be meaningful, and since score itself isn’t a useful predictor of success, we would not expect any further tiebreaker to be anything but a coinflip.

These results surprised me. What this is effectively saying is that skill only played a minor part in which player won a cut game of 1.0 X-Wing, which was the general attitude at the time, but I would have still expected there to be some correlation between Swiss standing and cut success. In First Edition, we often heard statements similar to players just need to make the cut consistently to be entered into the matchup lottery for who wins the event based on matchups, but just how true that mentality was is surprising. Of course there is skill in bringing the right list to get those matchups, but that is not skill that can be measured by a tournament system like Swiss.

Second Edition

In cut games, the player with higher Swiss score wins their game against the lower Swiss scoring player 58% of the time.
In cut games where each player had the same Swiss score, the player with higher Strength of Schedule won 51% of the time. Again, as the tiebreaker, this is capped by the accuracy of Swiss scores, so a coinflip here is not surprising.

These results surprised me much more than the 1.0 results. Yes, bids and matchup variance were also huge in 2.0, but I would have expected with the much more level playing field that skill were a larger component in determining the winner. It’s important to state that this is not saying that skill is not a large component in determining the winner of a given game of X-Wing, in these cases we are only looking at players making the cut. If we had data to be able to answer questions about what would happen having the absolute top seed of a tournament play the absolute bottom, I’d love to poke through it as well, but unfortunately those games do not happen outside of Swiss, and we can’t contaminate our dataset by trying to make predictions about events that are in our dataset.

Rather, I think the message here is that 2.0 X-Wing largely accomplished it’s goals of reducing matchup variance without changing the game at a fundamental level. Yes, you still can’t predict the winner of individual games from past results particularly high accuracy, but it’s still a notable improvement from 1.0. Yes, matchups still existed, but to use an extreme example, there wasn’t Kylo Ren crew on a Decimator automatically winning the game against lists with only two ships.

2.69420 (2.5, post May Updates)

In cut games, the player with higher Swiss score wins their game against the lower Swiss scoring player 72% of the time.
In cut games where each player had the same Swiss score, the player with higher Strength of Schedule won 58% of the time.

One of the common complaints about 2.5 from some tournament players is that it is effectively a different game, and these results confirm there is some truth to that. It makes intuitive sense that removing bids and restructuring the game around scenarios would increase player skill on the individual game outcome, but that doesn’t necessarily mean that it was going to be successful. It also could have been the case that matchup and scenario variance would dominate player skill, and again you would not be able to make useful predictions from Swiss standings. But since we can accurately predict the winner of a given cut game based on Swiss results, this indicates that the correlation between player skill and individual game outcome is high.

This is great news for tournament design, as it makes it much more likely that you’re able to create a bracket where the best players meet at the end. Also, I’m curious what a good elimination format would have looked like for 1.0 and 2.0, given that we now know seeded brackets are effectively meaningless. But that is spoilers for the next update!

Further Research and Questions

That’s as much time as I’d like to put into exploring these questions, as we have the answers we need for bracket generation, but I am curious to see what the correlations for 2.5 cut games with players with tied score. Is it the case that the larger the difference in Strength of Schedule between two players, the more likely the player with higher Strength of Schedule wins?

Also, I am more than happy to provide any of the datasets used for this for validation or other questions about this data. As of the time of posting this, all of the data in ListFortress is most easily accessed via API explained here, though ListJuggler is permanently down. Please reach out if you’re interested in that ListJuggler data and I will find a way to anonymize the data and get it to you.

Credits

Alex Raubach, for provided the ListJuggler database to check 1.0 stats
Travis Johnson, for handling the 1.0 math as I truly despise Excel
More anonymous proofreaders

Thanks again!

Chris’s Newsletter

Discussion about this post