Predictive strength of Expected Goals 2.0

By Matthias Kullowatz (@MattyAnselmo)

It is my opinion that a statistic capable of predicting itself---and perhaps more importantly predicting future success---is a superior statistic to one that only correlates to "simultaneous success." For example, a team's actual goal differential correlates strongly to its current position in the table, but does not predict the team's future goal differential or future points earned nearly as well. I created the expected goals metrics to be predictive at the team level, so without further ado, let's see how the 2.0 version did in 2013.

Mid-season Split

In predicting future goals scored and allowed, the baseline is to use past goals scored and allowed. In this case, expected goals beats actual goals in its predictive ability by quite a bit.*

Predictor Response R2 P.value
xGD (by gamestate) GD (last 17) 0.805 0.000
xGD(first 17) GD (last 17) 0.800 0.000
xGA (first 17) GA (last 17) 0.604 0.000
GD (first 17) GD (last 17) 0.487 0.000
xGF (first 17) GF (last 17) 0.409 0.004
GA (first 17) GA (last 17) 0.239 0.024
GF (first 17) GF (last 17) 0.155 0.099

Whether you're interested in offense, defense, or differential, Expected Goals 2.0 outperformed actual goals in its ability to predict the future (the future in terms of goal scoring, that is). That 0.800 R-squared figure for xGD 2.0 even beats xGD 1.0, calculated at 0.624 by one Steve Fenn. One interesting note is that by segregating expected goals into even gamestates and non-even gamestates, very little predictive ability was gained (R-squared = 0.805).

Early-season Split

Most  of those statistics above showed some predictive ability in 17 games, but what about in fewer games? How early do these goal scoring statistics become stable predictors of future goal scoring? I reduced the games played for my predictor variables down to four games---the point of season we are currently at for most teams---and here are those results.

Predictor Response R2 P.value
xGD (by gamestate) GD (last 30) 0.247 0.104**
xGA (first 4) GA (last 30) 0.236 0.033
xGD(first 4) GD (last 30) 0.227 0.028
xGF (first 4) GF (last 30) 0.140 0.093
GF (first 4) GF (last 30) 0.022 0.538
GD (first 4) GD (last 30) 0.015 0.616
GA (first 4) GA (last 30) 0.003 0.835

Some information early on is just noise, but we see statistically significant correlations from expected goals on defense (xGA) and in differential (xGD) after only four games! Again, we don't see much improvement, if any at all, in separating out xGD for even and non-even gamestates. If we were to look at points in the tables as a response variable, or perhaps include information on minutes spent in each gamestate, we might see something different there, but that's for another week!

Check out the updated 2014 Expected Goals 2.0 tables, which now just might be meaningful in predicting team success for the rest of the season.

*A "home-games-played" variable was used as a control variable to account for those teams who's early schedule are weighted toward one extreme. R-squared values and p-values were derived from a sequential sum of squares, thus reducing the effects of home games played on the p-value. 

**Though the R-squared value was higher, splitting up xGD into even and non-even game states seemed to muddle the p-values. The regression was unsure as to where to apportion credit for the explanation, essentially. 

The Predictive Power of Shot Locations Data

Two articles in particular inspired me this past week---one by Steve Fenn at the Shin Guardian, and the other by Mark Taylor at The Power of Goals. Steve showed us that, during the 2013 season, the expected goal differentials (xGD) derived from the shot locations data were better than any other statistics available at predicting outcomes in the second half of the season. It can be argued that statistics that are predictive are also stable, indicating underlying skill rather than luck or randomness. Mark came along and showed that the individual zones themselves behave differently. For example, Mark's analysis suggested that conversion rates (goal scoring rates) are more skill-driven in zones one, two, and three, but more luck-driven or random in zones four, five, and six. Piecing these fine analyses together, there is reason to believe that a partially regressed version of xGD may be the most predictive. The xGD currently presented on the site regresses all teams fully back league-average finishing rates. However, one might guess that finishing rates in certain zones may be more skill, and thus predictive. Essentially, we may be losing important information by fully regressing finishing rates to league average within each zone.

I assessed the predictive power of finishing rates within each zone by splitting the season into two halves, and then looking at the correlation between finishing rates in each half for each team. The chart is below:

Zone Correlation P-value
1 0.11 65.6%
2 0.26 28.0%
3 -0.08 74.6%
4 -0.41 8.2%
5 -0.33 17.3%
6 -0.14 58.5%

Wow. This surprised me when I saw it. There are no statistically significant correlations---especially when the issue of multiple testing is considered---and some of the suggested correlations are actually negative. Without more seasons of data (they're coming, I promise), my best guess is that finishing rates within each zone are pretty much randomly driven in MLS over 17 games. Thus full regression might be the best way to go in the first half of the season. But just in case...

I grouped zones one, two, and three into the "close-to-the-goal" group, and zones four, five, and six into the "far-from-the-goal" group. The results:

Zone Correlation P-value
Close 0.23 34.5%
Far -0.47 4.1%

Okay, well this is interesting. Yes, the multiple testing problem still exists, but let's assume for a second there actually is a moderate negative correlation for finishing rates in the "far zone." Maybe the scouting report gets out by mid-season, and defenses close out faster on good shooters from distance? Or something else? Or this is all a type-I error---I'm still skeptical of that negative correlation.

Without doing that whole song and dance for finishing rates against, I will say that the results were similar. So full regression on finishing rates for now, more research with more data later!

But now, piggybacking onto what Mark found, there does seem to be skill-based differences in how many total goals are scored by zone. In other words, some teams are designed to thrive off of a few chances from higher-scoring zones, while others perhaps are more willing to go for quantity over quality. The last thing I want to check is whether or not the expected goal differentials separated by zone contain more predictive information than when lumped together.

Like some of Mark's work implied, I found that our expected goal differentials inside the box are very predictive of a team's actual second-half goal differentials inside the box---the correlation coefficient was 0.672, better than simple goal differential which registered a correlation of 0.546. This means that perhaps the expected goal differentials from zones one, two, and three should get more weight in a prediction formula. Additionally, having a better goal differential outside the box, specifically in zones five and six, is probably not a good thing. That would just mean that a team is taking too many shots from poor scoring zones. In the end, I went with a model that used attempt difference from each zone, and here's the best model I found.*

Zone Coefficient P-value
(Intercept) -0.61 0.98
Zones 1, 3, 4 1.66 0.29
Zone 2 6.35 0.01
Zones 5, 6 -1.11 0.41

*Extremely similar results to using expected goal differential, since xGD within each zone is a linear function of attempts.

The R-squared for this model was 0.708, beating out the model that just used overall expected goal differential (0.650). The zone that stabilized fastest was zone two, which makes sense since about a third of all attempts come from zone two. Bigger sample sizes help with stabilization. For those curious, the inputs here were attempt differences per game over the first seventeen games, and the response output is predicted total goal differential in the second half of the season.

Not that there is a closed-the-door conclusion to this research, but I would suggest that each zone contains unique information, and separating those zones out some could strengthen predictions by a measurable amount. I would also suggest that breaking shots down by angle and distance, and then kicked and headed, would be even better. We all have our fantasies.

Noisy Finishing Rates

As a supplement to the stabilization analysis I did last week, I wanted to add the self-predictive powers of finishing rates—basically soccer’s shooting percentage. Team finishing rates can be found both on our MLS Tables and in our Shot Locations analysis, so it would be nice to know if we can trust them. Last week I split the 2012 and 2013 seasons in half and assessed the simple linear relationships for various statistics between the two halves of each season across all 19 teams. Now I have 2011 data, and we can have even more fun. I included bivariate data from both 2011 and 2012 together, leaving out 2013 since it is not over yet. It is important to note that I am not looking across seasons, only within seasons. To the results!

Stat Correlation Pvalue
Points

0.438

0.7%

Total Attempts

0.397

1.5%

Blocked Shots

0.372

2.3%

Shots on Goal

0.297

7.4%

Goals

0.261

11.9%

Shots off Goal

0.144

39.5%

Finishing

0.109

52.1%

Surprisingly, to me at least, a team’s points earned has been the most stable statistic in MLS (by my linear definition of stability). Not so surprising to me was that total attempts is also one of the most stable. Look down at the very bottom, and you’ll find finishing rates. Check out the graph below:

 Finishing Rates Stabilization 2011-2012

Some teams finish really well early in the season, then flop. Others finish poorly, then turn it on. But there’s no obvious to pattern that would allow us to predict second-half finishing rates. In fact, the best prediction for any given team would be to suggest that they will regress to league average, which is exactly what our Luck Table does. It regresses all teams’ finishing rates in each zone back to league averages, then calculates an expected goal differential.

On a side note, you might be asking yourself why I don't just use points to predict points. Because this: while the correlation between first-half and second-half points is about 0.438, the correlation between first-half attempts ratios and second-half points is slightly stronger at 0.480. Also, in a multiple regression model where I let both first-half attempts ratio and first-half points duke it out, first-half attempts ratio edges out points for winner of the predictor trophy.

Estimate Std. Error T-stat P-value
Intercept 1.7019 5.97 0.285 77.7%
AttRatio 13.7067 6.32 2.17 3.7%
Points 0.3262 0.19 1.691 10.0%

And since this is a post about finishing rates...

Estimate Std. Error T-stat P-value
Intercept -2.243 7.75 -0.29 77.4%
AttRatio 18.570 5.71 3.26 0.3%
Finishing% 63.743 50.08 1.27 21.2%

A good prediction model (on which we are working) will include more than just a team's attempts ratio, but for now, it is king of the team statistics.