One of the breakout stars of the 2010 men’s football World Cup was undoubtedly Paul the Octopus. From his base in a Sea Life Centre in Germany, this smarter-than-average cephalopod correctly predicted the results of all seven of Germany’s matches, and correctly chose Spain as the overall winner.
Unfortunately Paul is no longer with us, but his legacy lives on in people’s attempts to predict the results of football tournaments. At the Turing, with the 2022 World Cup rapidly approaching, we decided to find out whether an algorithm can predict this year’s winner.
How does our algorithm work?
Our algorithm (or, more specifically, statistical model) is based on AIrsenal, which we developed in 2018 for playing Fantasy Premier League. This in turn is based on a 1997 model by Dixon and Coles (something of a classic in football prediction circles) that has parameters for a team’s attack strength, defence strength and home advantage, and uses Bayesian statistics to calculate the most likely scoreline for a given match.
We modified AIrsenal to make it more suited to predicting international results. For example, international teams most often play teams from the same continent (Brazil hasn’t played any European teams since 2019!), which can result in biases when trying to predict results between teams from different continents. To address this, we introduced model parameters for the relative strengths of the different continental federations. We also needed to tweak our model to account for the fact that home advantage doesn’t apply in international tournaments, unless the host nation is playing.
What training data did we use?
To predict the winner of the 2022 World Cup, we first needed to train our model with past data. Fortunately, GitHub user martj42 has compiled a comprehensive database of every international football match since 1872! After experimenting with our model’s success at predicting the results of the 2014 and 2018 World Cups, we decided to use all international results from the 2002 World Cup onwards.
In our training data, we give most importance (weight) to World Cup matches (and then decreasing weight to continental tournaments, qualifiers, and friendlies), and we also give more weight to recent matches. We also feed the official FIFA rankings into our model to provide an up-to-date estimate of team performance.
What data aren’t we using?
Unfortunately, we can’t take account of every factor that might determine this year’s winner.
- Players: many World Cups are remembered for outstanding performances from individual players – think Maradona in 1986, Zidane in 1998, Ronaldo (the Brazilian one) in 2002. This year’s tournament is sure to be set alight by a footballing superstar. However, we’re not looking at any of that. Predicting the line-ups of international teams, who play together a few times a year, is far more challenging than doing so for Premier League sides that play week in, week out.
- Penalties: England fans will be painfully aware that not all teams seem to be equally adept at penalty shootouts! However, rather than gather historical data on penalty shootout success, we have taken the simpler approach of assigning a 50/50 chance for which team progresses in the event of a draw during the knockout stages.
- Location / weather / anything else: although the past four tournaments have been won by European teams, up until 2010 all World Cups held outside Europe had been won by South American sides. So, will the heat of Qatar favour Brazil and Argentina, or will the Europeans take advantage of the relatively small time zone difference? Or will a combination of these factors help bring about the first African winner? All these questions are hard to answer, and even harder to model.
What does our model predict?
If we run through the 2022 World Cup’s entire fixture list 100,000 times, we can see which teams win the tournament most often:
Clearly, Brazil are heavy favourites, with around 25% chance of winning, while Belgium and Argentina are also highly rated. England are the fifth most likely team to win.
We can also see how far England and Wales are likely to progress:
The model gives England around an 80% chance of getting out of the group stage, just under 60% chance of getting to the quarter-finals or further, but only around 7% chance of winning the whole thing. Wales, on the other hand, have around 50% chance of progressing beyond the group stage, 2% chance of getting to the final, and just 0.5% chance of winning the tournament.
Should you trust our model?
We certainly wouldn’t recommend betting on any of our predictions! Although the original Dixon and Coles model was partly motivated by a desire to better bookmakers’ odds, that was 25 years ago, and bookies nowadays have teams of data scientists that work full-time on sophisticated models that take into account all the factors we are neglecting. For us, this is a fun side-project, and our goal was to create a simple, understandable, robust and open-source model, which anyone is free to use, improve and contribute to. We welcome any questions and feedback.
Ultimately, no matter how good your model, football is quite a random game. This is one reason why we love it, and it’s also why fans of England and Wales can continue to dream that their team will defy the odds. See you on the other side!
Top image: A. Ricardo / Shutterstock