With the Swedish leagues slowly coming to an end I wanted to look at creating a method for making a qualified, statistically based projection of how football leagues will end up. As the name of this blog now implies, I have approached the problem with machine learning and the results look really promising!
Living in Östersund, as I am, following ÖFK, the question about projecting the resulting table positions is extra interesting right now, with ÖFK in an, as it seems more or less secure spot for advancement. The only twist would be that ÖFK has left to play both J-Södra, leading the table as well as third placed Sirius.
The first thing I did to get a view on the probability for an advancement to Allsvenskan was to simply look at past Superettan tables after 20 rounds and look at how often a team with the same amount of points as ÖFK has reached advancement. That result was actually 14 of 15 seasons. Still, this is of course an extremely primitive statistical approach and doesn’t at all consider the remaining opponents or any other basic statistics.
I also tried to look at other work on this and didn’t find that much useful. Most models I found are based on some kind of ranking of the teams, like the one Goal Impact uses. And that was not what I was looking for at all. Isn’t that something that the table already implies? I wanted to look at the table movements based on remaining opponents and team stats. I know about Michael Caleys model but his is more or less entirely based on expected goals, a number I don’t have access to in enough numbers to use in any predictive machine learning model. I only have that for this and last year and only for Superettan and Allsvenskan.
So, I tried to look at what I could use instead of Expected Goals to create a predictive model. The first thing that came to my mind was the Everysport API. It is a quite basic api, not giving access to anything that can’t be found elsewhere for free. But, it has a very convenient method for accessing table standings at a certain round – perfect for this project. So, my idea basically became to train a Machine Learning model with all the data I could get from the Everysport API for every team at a certain round and then use the resulting points as the test data. I went back approximately 10 years for some leagues, looking at nearly 1000 objects/teams and how they finished based on stats at a certain round. To be able to use data from different leagues, with different number of total rounds in the same model I created the concept of a delta, that would be n-number of rounds from the last round of the current league. In these tests I have used a delta of 10. Of course, the theory would be that the smaller the delta the better the prediction should become. But more on that in coming posts.
To visualize the results I have plotted the predicted amount of points on the x-axis and the resulting points on the y-axis. In the first iteration I looked only at the current league position for the teams and what resulting points that gave. This is what happened when I plotted it:
Not the best fit ever seen. The spread of resulting points from the prediction was really large. Bear in mind that 30 points is the maximum theoretical so the correlation between the league position and how many points a team will take really isn’t as strong as one might think. Definitely not as strong as I thought it would be. This is a model based on more than 30 different leagues so I’d say that it is robust and an interesting point. Now, for the next iteration I added stats for the current team. The stats available from the Everysport API is, current league position, wins, draws, losses, goals scored and goals conceded. I added all of them and this is what then happened.
The r² for a linear regression on the plot increased to 0.33 but what’s more interesting is the actual equation of the regression. It is more or less an ideal straight line. The problem still, of course is the quite large spread of resulting points making the model a bit hard to interpret. The last thing I was able to use the Everysport API for was adding the same stats as for the current team to all its opponents. The idea behind it is to see whether the strength of the remaining opponents affect the predicted points. Interesting for me as I wrote in the lead about ÖFK having left to face two of the best teams in the league.
The r² increased even more! This post is long as it is but to sum this first part up I can show a visualization of the model where I have decreased the delta to 5 and replaced the markers with a box-plot listing every predicted points result as a standard deviation chart. The plot is explained here. And how to read it in comparison to a probability density function is explained on Wikipedia.
This is when it gets really interesting and actually interpretable as I see it. Lets say that a team with 5 rounds to go gets a prediction of taking 10 more points. Then the possibility for them taking between 8-12 points, (Q1-Q3 in the plot), would be 50%. It also is statisically more or less impossible that the team would take less than 4 points (min). Knowing this makes it possible to project the resulting table which I will do in the following posts. But, first, in the next post, I will show what happens when I add more data to the model using my own football-data library.
Delivered by Everysport