So, I got some great feedback on my first blog post! Now I feel that I have to explain the model in-depth and make some validations. One thing first, I will try to never write long blog posts, Google Analytics tells me that the average visitor reads my blog for 3 minutes so the posts I write will all be readable in that time.
First, what is ExpG? I think I will borrow a quote from 11tegen11 that sums it up;
ExpG stands for Expected Goals. It measures not how many goals a team has scored, but how many goals an average team would have scored with the amount and quality of shots created.
Each goal scoring attempt is assigned a number based on the chance that this attempt produces a goal. Typical parameters to use are shot location and shot type (shot vs header). Some models, including the one I use on 11tegen11, also use assist information to separate through-balls from crosses.
Teams that produce more ExpG than they concede have the best chances of winning football matches.
Now, my approach on calculate ExpG was a pragmatic approach, being a Python-programmer. I made a data set of every event recorded in Allsvenskan, Superettan and Div 1 for the last two seasons. The set contains:
- Coordinates – position on the pitch from where the finish is made
- Is it a finish from a Corner – yes/no
- Is it a finish from a Freekick – yes/no
- Is it a penalty – yes/no
- Time of the event
- Current possession of the team creating the event (in 5min intervals)
That was all the data I could find!
Using the eminent Scikit-learn library I then trained a model with every event, excluding only the ones I wanted to calculate ExpG on, e.g. a teams ExpG this season.
And the first results was what you could see in my last blog post. And as I suspected I was not first with using a Machine Learning approach to solving the ExpG problem, probably among the first but that really doesn’t matter for me. I just wrote it to draw some attention. This blog post from Martin Eastwood was really interesting (written two days after me 🙂 ), and made me wanting to do some verifications right away to see if I could come up with similar results using Superettan and Allsvenskan. My approach was somewhat similar, however I´m not sure I agree in that using a classification rather than a regression is the best approach. My philosophical view on a finish is more of the viewpoint that if you repeat the same finish 10 times or just make small modifications to it you will not get the same result every time. I don’t believe in the binary paradigm when it comes to football. That’s why I in my first iteration used a Linear Bayesian Ridge regression.
Said and done. I started off where I ended my last blog post. Just testing the same model, with the Linear BayesianRidge regression on some teams entire season 2015 so far. I’ll use Gefle as the example here (I’ll use them a lot, they are a very average team 🙂 ). This is what the Bayesian Ridge looked like when plotted for the entire season of 2015 so far:
Now, I have to be honest. Since I’m no mathematician, trial and error has always been the way I tend to solve machine learning problems. Testing different models until I have found one that makes the best match. And when I looked at the above result I instantly noticed that the BayesianRidge not made a good approximation of the y-position. It seemed as if it always treated lower values as better. So I tried with an enseble method instead, a Gradient Boosting Regressor.
Now that made a lot more sense. The larger the circle the larger the ExpG/chance of a goal. And as Michael Caley has shown in this article it gets quite obvious that the location of the shot is of a very large importance when it comes to convert a finish to a goal.
Still, in the above plot one important thing went missing. I had missed out of indicating if the finish came from a penalty or not. When added the plot went like this:
A very large circle from the spot indicated the one penalty Gefle has had this season. I also added the ExpG on every circle on hover. The penalty had an ExpG of 0.78 which I think matches the model of Martin Eastwood more or less spot on!
Now, you might wonder how many goals Gefle have made so far this season? Well, it is 18
One thing that is neat using Scikit-learn and the Gradient Boosting Regressor is the feature_importances property. From the docs:
"""Return the feature importances (the higher, the more important the
So I just ran this on my model to see how the parameters I have are valued
[ 0.34722513 0.26157164 0.02453219 0.0147424 0.05105901 0.19241527
What does it mean?
xpos is valued at 35%
ypos is valued at 26%
finish from corner at 2%
finish from freekick at 1%
finish from penalty at 5%
game time at 19%
possession at 11%
So my conclusion is that the position of the finish is the most valuable factor for whether a finish will be converted to a goal or not. But the time of the game as well as the possession of the team creating the chance actually also affects the outcome. Probably even penalties, but that has to be another blog post.
So, to sum this up I have made the same approximation that martin Eastwood made that plots the Expected Goals as y-values and the observed goals as x-values to check the correlation for an entire season. I used this season, halfway trough:
Now the r2 is lower, at 0.63 but the season is only half way trough and I see one team, Örebro obviously under achieving. I removed Örebro from the plot, giving the r2 a boost to 0.7. So, I believe in giving this model a chance. What do you say?