So, I got some great feedback on my first blog post! Now I feel that I have to explain the model in-depth and make some validations. One thing first, I will try to never write long blog posts, Google Analytics tells me that the average visitor reads my blog for 3 minutes so the posts I write will all be readable in that time.

First, what is ExpG? I think I will borrow a quote from 11tegen11 that sums it up;

ExpG stands for Expected Goals. It measures not how many goals a team has scored, but how many goals an average team would have scored with the amount and quality of shots created.

Each goal scoring attempt is assigned a number based on the chance that this attempt produces a goal. Typical parameters to use are shot location and shot type (shot vs header). Some models, including the one I use on 11tegen11, also use assist information to separate through-balls from crosses.

Teams that produce more ExpG than they concede have the best chances of winning football matches.

Now, my approach on calculate ExpG was a pragmatic approach, being a Python-programmer. I made a data set of every event recorded in Allsvenskan, Superettan and Div 1 for the last two seasons. The set contains:

- Coordinates – position on the pitch from where the finish is made
- Is it a finish from a Corner – yes/no
- Is it a finish from a Freekick – yes/no
- Is it a penalty – yes/no
- Time of the event
- Current possession of the team creating the event (in 5min intervals)

That was all the data I could find!

Using the eminent Scikit-learn library I then trained a model with every event, excluding only the ones I wanted to calculate ExpG on, e.g. a teams ExpG this season.

And the first results was what you could see in my last blog post. And as I suspected I was not first with using a Machine Learning approach to solving the ExpG problem, probably among the first but that really doesn’t matter for me. I just wrote it to draw some attention. This blog post from Martin Eastwood was really interesting (written two days after me 🙂 ), and made me wanting to do some verifications right away to see if I could come up with similar results using Superettan and Allsvenskan. My approach was somewhat similar, however I´m not sure I agree in that using a classification rather than a regression is the best approach. My philosophical view on a finish is more of the viewpoint that if you repeat the same finish 10 times or just make small modifications to it you will not get the same result every time. I don’t believe in the binary paradigm when it comes to football. That’s why I in my first iteration used a Linear Bayesian Ridge regression.

Said and done. I started off where I ended my last blog post. Just testing the same model, with the Linear BayesianRidge regression on some teams entire season 2015 so far. I’ll use Gefle as the example here (I’ll use them a lot, they are a very average team 🙂 ). This is what the Bayesian Ridge looked like when plotted for the entire season of 2015 so far:

Now, I have to be honest. Since I’m no mathematician, trial and error has always been the way I tend to solve machine learning problems. Testing different models until I have found one that makes the best match. And when I looked at the above result I instantly noticed that the BayesianRidge not made a good approximation of the y-position. It seemed as if it always treated lower values as better. So I tried with an enseble method instead, a Gradient Boosting Regressor.

Now that made a lot more sense. The larger the circle the larger the ExpG/chance of a goal. And as Michael Caley has shown in this article it gets quite obvious that the location of the shot is of a very large importance when it comes to convert a finish to a goal.

Still, in the above plot one important thing went missing. I had missed out of indicating if the finish came from a penalty or not. When added the plot went like this:

A very large circle from the spot indicated the one penalty Gefle has had this season. I also added the ExpG on every circle on hover. The penalty had an ExpG of 0.78 which I think matches the model of Martin Eastwood more or less spot on!

Now, you might wonder how many goals Gefle have made so far this season? Well, it is **18**

One thing that is neat using Scikit-learn and the Gradient Boosting Regressor is the *feature_importances* property. From the docs:

`"""Return the feature importances (the higher, the more important the`

feature).

So I just ran this on my model to see how the parameters I have are valued

print clf.feature_importances_

[ 0.34722513 0.26157164 0.02453219 0.0147424 0.05105901 0.19241527

0.10845434]

What does it mean?

xpos is valued at 35%

ypos is valued at 26%

finish from corner at 2%

finish from freekick at 1%

finish from penalty at 5%

game time at 19%

possession at 11%

So my conclusion is that the position of the finish is the most valuable factor for whether a finish will be converted to a goal or not. But the time of the game as well as the possession of the team creating the chance actually also affects the outcome. Probably even penalties, but that has to be another blog post.

So, to sum this up I have made the same approximation that martin Eastwood made that plots the Expected Goals as y-values and the observed goals as x-values to check the correlation for an entire season. I used this season, halfway trough:

Now the r^{2} is lower, at 0.63 but the season is only half way trough and I see one team, Örebro obviously under achieving. I removed Örebro from the plot, giving the r^{2} a boost to 0.7. So, I believe in giving this model a chance. What do you say?

Great approach and very well explained! I’m myself going to test a similar model soon for the Spanish league 🙂

I’m now getting comfortable with Pyhon, I’m more of a JS user,

MySQL or R.

Keep the posts flowing!

LikeLiked by 1 person

Thanks!

LikeLike

hey just stumbled upon this site…I’ve been hunting for allsvenskan shot data and came up empty, do you mind sharing where you get it?

Currently I have xG models for Argentina, Netherlands and a few other less followed leagues but looking to add! thanks

LikeLiked by 1 person

Hi!

I use svenskfotboll.se as source for the shot data. They have it under livescores. Where do you get shots data for Netherlands btw? Been looking for that 🙂

LikeLike

Looks nice and clean. Very good alternative metod but do you think that gradient boosting or ML in general is able to capture the exponential drop in expg by distance/angle by just feeding x/y position? What if the same data is feeded but angle/distance is used instead? Also additional parameter that could easily be added is game state (-1/0/+1).

Great work

LikeLike

Thanks!

Game state has been added, https://blog.stryktipsetisistastund.se/2015/08/04/impact-of-game-state-to-my-expected-goalsexpg-model/ 🙂

Since this is an old post I have had time to look into the question you have there regarding angles. And I have turned to more see this as a classification problem and with that approach feeding the algorithm with pure coordinates makes more sense. With that approach angles/distance or pure coordinates makes no difference.

LikeLike

You got me interested and I read the updated article. I’ve made some tests and indeed coordinates really give better result than distance/angle. Also using only shots on target gives a lot better result for some reason. The game state impact at 27% though I think is really way too high in your article and is maybe due to a small sample size. Maybe you should cross validate and check.

While I don’t have your data to test directly I tested with shots from Premier Leage, La Liga, Bundesliga and Serie A for 4 seasons (around 100000 shots) and these are the feature importances I get with goals / expg r2 at 0.83:

xpos 0.408

ypos 0.382

header 0.058

corner 0.037

direct free kick 0.014

penalty 0.035

minute 0.019

game state 0.047

I couldn’t figure out how to get the possession in the last 5 minutes so that’s why I didn’t include it.

I went a little bit further and by got the r2 to around 0.87 by adding the type of pass (cross, trough ball), if there was a dribble just before the shot and if the shot was result of a fast break or was it a big chance (1 vs 1, etc.).

xpos 0.273

ypos 0.230

minute 0.121

game_state 0.041

header 0.023

corner 0.018

direct_free_kick 0.014

penalty 0.028

fast_break 0.040

big_chance 0.079

dribble 0.047

cross 0.036

through_ball 0.050

I have to say that I’m not expert in any of this so there could have been errors in my code.

Another thing that I was thinking is that xpos / ypos in the data I have are normalized in range from 0 to 100 so basically when the pitches are of different size the coordinates are a little bit off so they must be converted for even better result. I’ll try to get the pitch sizes for all the teams I have and test.

LikeLike

Very good!!

Yes I also have a lower impact from GS in my model nowadays. The next level of xG, to improve the r2 even more would be to add tracking of opponents positions relative to the ball… I think that would have a huge impact!

LikeLike

Hej!

Är det människor eller maskiner som tar fram datan? Jag menar är det en människa som markerar varifrån ett skott togs eller läses det av maskinellt på något sätt?

Hur fungerar det med vunna närkamper eller passningar?

Alltså hur tas hårddatan fram?

Vore intressant att veta med hänsyn av tillförlitligheten på datan.

Med vänlig hälsning,

LikeLike

Än så länge är det manuellt jobb.. Men vi jobbar på att automatisera det. Men det är just tillförlitligheten som inte riktigt är där än när det gäller helautomatisering..

LikeLike

Ok. Känner du till programmet instat?

Har du vetskap om deras håtddata fungerar på samma sätt?

Känns orimligt i mina ögon att all information om vilken spelare som varit inne hur länge (futsal), vem som skjutit varifrån osv, har lagts in manuellt.

LikeLike

Vet faktiskt inte hur de jobbar..

LikeLike