Short passes dominate every soccer game. They are the most abundant on-the-ball action. But the variation in short pass accuracy is small; the difference in short pass success rates between the best and the worst team in MLS is 13%. For a typical game with about 400 short passes, the difference represents 52 more successful attempts, or one extra pass every two minutes. How much impact can these extra passes have?

Atlanta United is especially dependent on short passes that lead to shots. What would a few more short passes mean for their offense? Yankee Stadium is a tough place for any visiting team. Critics say that it is too small, and only New York City FC play well there. How exactly do they take advantage of the home turf?

The best way to approach these questions other than watching thousands of clips is to make a model with data and use it to examine or even predict what a team excels or suffers. There isn’t one... yet. Can we make one?

Incremental Success Makes a Big Impact

Atlanta’s short pass accuracy has a significant impact on their offensive performance; the correlation coefficient is close to 0.71 between their short pass success rate and the number of shots they create without the contribution from a corner. A correlation coefficient (or Pearson correlation coefficient, R, to be precise) measures the strength and the direction of two variables (in our case, the short pass success rate and the number of shots created). It ranges from -1 to +1. +1 means that both variables have a perfect positive correlation (one value goes up, and another value goes up). -1 means that both variables have a perfect negative correlation (one value goes up, and another value goes down). Zero means that there isn’t a correlation between two variables. Bear in mind that even a perfect correlation coefficient doesn’t imply a cause-and-effect relationship between two variables, or even if there is one, the direction of it isn’t clear. I am making an assumption that the short pass success rate affects the number of shots created because a shot is the end result of any possession. After the shot, a possession changes hand – unless it results in the corner, but we don’t consider any corner in our analysis – so most passes should precede most shots (a pass can occasionally happen after a blocked shot).

With such a high correlation between them, we can use linear regression to describe the linear relationship between the short pass success rate and the number of shots created:

A scatterplot of Atlanta United’s short pass success rate and shots they created, per game in 2018. The two variables have a correlation coefficient (r = 0.71). So the more accurate Atlanta United make their short passes, the more shots they create.… — A scatterplot of Atlanta United’s short pass success rate and shots they created, per game in 2018. The two variables have a correlation coefficient (r = 0.71). So the more accurate Atlanta United make their short passes, the more shots they create. The red line depicts a trendline that can describe the relationship between the two variables with a equation: Number of shots = 107.4795 x (Short pass success rate) – 78.05535.

Number of shots = 107.4795 x (Short pass success rate) – 78.05535

The linear regression model is predictive for Atlanta because of its high correlation between the short pass accuracy and the chance creation. It doesn’t work as well for other teams because all of them have a correlation coefficient less than 0.6:

This model has an interesting implication; considering Atlanta United’s average short pass success is 84%, an increase of as little as 2% will help them to generate 2.4 more shots. How can such a tiny difference create so many extra shots? Because Atlanta United play about 422 short passes per game, a 2% increase means that they will make nine extra short passes, or one more short pass every five minutes (assuming they hold 50% of the possession). How can this small change impact the shot creation at all?

The answer lies in the way we approach the data. A team doesn’t try to make 400 individual passes. It is trying to string together several consecutive passes to advance the ball. We should not consider any pass as an individual action. They are part of a group of actions – the possession or a pass sequence/chain – that allows a team to reach a shooting position. Any failure in the individual action of the possession will lead to a breakdown. For example, Atlanta United averages 3.4 short passes per possession, meaning that for them to complete the possession, all those passes need to succeed together. 84% short pass success rate become 84%^3.4 = 55% possession completion rate if we ignore long passes.

We can design a model to test how these changes of pass success or possession completion affect the generation of shots. Consider this simple model with three components:

Number of shots = Number of possessions x Possession completion rate x Possession-to-shot generation rate

I am making three assumptions: 1) The successful outcome of any possession is that the ball reaches the shooting position. 2) The possession consists of only passes. 3) A team attempts to create a shot from every successful possession.

We know that on average, Atlanta United have 130 possessions per game, average 3.4 short passes and 0.1 long passes per possession, complete 55% of the long passes, and convert 22% of the possessions ending at the final third into shots. If we only manipulate the short pass success rate, our model becomes:

Number of shots = 130 (possessions per game) x 0.22 (final third possessions ending in shots) x (Long pass success rate)^(Number of long passes) x (Short pass success rate)^(Number of short passes)

Now let’s examine how many more shots they can generate with the short pass success rate increases from 84% to 86%:

Number of shots = 130 x 0.22 x 0.55^0.1 x [(0.86)^3.4-(0.84)^3.4] = 1.24 shots

According to our model, Atlanta United will make 1.2 more shots when their short pass success rate increases by 2% from their average.

To compare this result with the result from the linear regression model (2.4 shots), we can explain about 50% of the effect the increased short pass accuracy's contribution to Atlanta’s offensive performance. In a way, we are breaking down half of Atlanta’s offensive game into six variables (number of short passes, short pass success rate, number of long passes, long pass success rate, number of possessions, and the possession-to-shot conversion rate).

The remaining 50% of the effect can be explained by a myriad of factors: for example, the possession in our model only consists of the short passes. Making more passes may help one team to create more dribbling opportunities.

Atlanta’ high correlation between the short pass accuracy and chance creation may reflect their reliance on the short pass to not just advance the ball but to penetrate the defense. But the same linear regression model, with the short pass success as a critical variable, won’t work so well for other teams. They may rely on other actions, such as the dribble or the transition, to advance the ball. I assume that all the short passes are homogeneous when I measure their success rate, meaning a pass in the initial third is the same as the one in the middle third. That assumption also doesn’t apply to a lot of passes. But the most important point of the model is that it explains how incremental changes in a team’s behavior makes a significant impact on its shot creation. Actions need to be grouped into the possessions, and small individual differences add up to impact the outcome of the possession.

Tiny Field Makes a Big Difference

From the start of a possession, we can summarize the offensive phase into a model:

Number of goals = Number of possessions x Possession completion rate x Possession-to-shot generation rate x Shot-to-goal conversion rate

The model helps us to break down the offensive phase into multiple components we can individually measure. Two teams that score the same number of goals can do so with different styles: one team may amass a considerable amount of possessions but convert them into shots with low efficiency while the other may not have as many possessions but turn a vast amount of them into goals. We can isolate what a critical weakness or strength for any team is.

Let’s take one example to see how such a model can be useful: New York City FC. The tiny Yankee Stadium not only affects how a visiting team like Atlanta plays its offensive game, but it also boasts New York City’s attack performance; New York City FC create 1.1 more xG at home (2.3 xG and 1.2 xG at home and away, respectively). The xG difference is the second highest in MLS. With a similar xG/shot ratio (0.126 at home vs. 0.11 away), New York City raises its firepower by 6.6 more shots at home – a 67% increase from what they create in other stadiums without considering the corner – the highest in MLS.

With a smaller field compared to other stadiums, New York City’s players play more aggressive defense at home than away. Charles Boehm has written that that Yankee Stadium’s small size made the game frenetic and chaotic. He is right: New York City FC create 1.4 more tackles per 100 opponent’s passes at home, the 2nd most substantial increase between home and away game in MLS. But the increased defensive intensity doesn't help them create better chances at home; the correlation coefficient between their tackling intensity and xG is only 0.15, meaning that neither variable influence each other (or they do with a weak influence). Increasing defensive pressure cannot explain how New York City increases its offensive power at home.

So what factors are increasing New York City’s shot creation in the tiny Yankee Stadium? Let go back to our shot creation model:

Number of shots = Number of possessions x Possession completion rate x Possession-to-shot generation rate

If we can identify the critical change(s) that affect the above variables at home and away, we may be able to rationalize how the small field size impact it/them. In the table below I determine all three variables when they play at home or away.

New York City FC 2018   Home    Away   Difference (Home minus Away)   % change (Difference divided by Away)  
Number of possessions 151 137 14 10.2
Possession completion rate 0.34 0.24 0.10 41.7
Possession-to-shot generation rate 0.30 0.29 0.01 3.4
Number of shots 16 9.4 6.6 70.2

Applying these values to our model, we will get an increase of 5.9 shots when New York City plays at home. The difference accounts for 90% of the actual variation (6.6 shots).

Their possession-to-shot generation rate has only a marginal boost of about 3% increase when they play at home. The difference only accounts for 0.4 shots or 7% of the models’ prediction. The number of possessions increases from 137 when they play away to 151 when they play at home, about a 10% increase. This difference accounts for 1.2 shots or 20% of the models’ prediction. This result is consistent with Boehm’s suggestion of the chaotic gameplay at Yankee Stadium: the more frantic a game becomes, the more possessions each team should hold.

But the most significant impact comes from the possession completion rate. It has a 42% increase when New York City plays at home and accounts for 73% of the models’ prediction. It is the critical factor that helps New York City increasing its firepower at home. But how does New York City optimize its tactics to complete more possession?

For all the team’s behaviors I have examined, two things stand out when New York City plays at home: it pushes its defensive line by 7.8% higher at home, the 2nd highest in MLS. It also starts its possession 7.7% closer to the opponent’s goal at home, the highest in MLS. Both behaviors correlate with New York City’s chance creation:

A correlation plot of the average location of tackles and the xG a team creates. For example, Sporting Kansas City have a correlation coefficient close to 0.75 between the two variables, meaning the higher they hold their tackling line, the mor… — A correlation plot of the average location of tackles and the xG a team creates. For example, Sporting Kansas City have a correlation coefficient close to 0.75 between the two variables, meaning the higher they hold their tackling line, the more chances they've created.

A correlation plot of the average start location of possessions and the xG a team creates. For every MLS team, starting the possessions higher helps it create better chances.

The small field can push New York City’s defensive line closer to the opponent’s goal because it has less space to cover. A team always needs to calculate the risk of leaving the area behind the last defenders. Yankee Stadium is about 15% smaller than most other stadiums, meaning the players have 15% less space to cover. They run with fewer risks when they press higher at home. When their defensive line so high, they recover more possessions closer to the opponent’s goal. The Yankee Stadium’s field is at least 5% shorter than most others, so it should also be easier for the ball to reach the shooting position.

Modeling a Soccer Game

Number of shots = Number of possessions x Possession completion rate x Possession-to-shot generation rate

Our model has two key points: It groups individual events together into the possession and breaks down the offensive phase into a sequence of steps. Any individual variation culminates to impact the number of goals. Consider Atlanta United as an example: 400 passes becomes 140 possessions. 84% pass success rate becomes 55% possession completion rate. Instead of dealing with more than 300 successful passes, we are merely dealing with 84 successful possessions that arrive the shooting positions. 20 tackles can only affect 5% of passes, but they will hit 20% of possessions.

Even if we focus only on the possession completion, our model is vastly simplifying the actual game; Atlanta United do not play only 3.4 short passes and 0.1 long passes in each possession. Anyone familiar with their play will assume that they play a lot more than three short passes when they control the ball. But that feeling ignores all those back-and-forth possession exchanges with the opponents. Nevertheless, we have to include more events such as the dribble into the possession:

Possession completion rate = (Long pass success rate)^(Number of long passes) x (Short pass success rate)^(Number of short passes) x (Dribble success rate)^(Number of dribbles)

Again, the completed/successful possession is one that reaches the shooting position, or the final third. On average, this model estimated 19% more successful possessions than the actual data. It isn’t bad, considering that the model contains only six variables.

Our model has a significant trend: it increases its over-estimation of the possession completion rate as a team controls fewer possessions.

A scatterplot of the accuracy of the model and its relation with the number of possessions a team completes. Y-axis shows the difference between the predicted and the actual number of successful possessions. It is calculated by (model predicted numb… — A scatterplot of the accuracy of the model and its relation with the number of possessions a team completes. Y-axis shows the difference between the predicted and the actual number of successful possessions. It is calculated by (model predicted number of successful possessions – actual number of successful possession) / actual number of successful possession. For New York and New England, the prediction is almost exactly the same as the actual data. For San Jose and Chicago, the prediction is more than 30% over-estimated compared to the actual data.

The model has an almost perfect negative correlation coefficient (r = -0.86) between the predicted possession complete rate and the number of possessions. This relationship between the two variables and the general over-estimation of the model has two implications.

First, we have not included sufficient variables in the model. Our model over-estimates the possession completion rate for every team. As we multiply more variables with fractional value, any success rate is a fractional value (0.xx), we will bring down the final possession completion rate. One variable that we are indeed missing is a player merely carrying the ball forward. It isn’t annotated as any specific action, but we can probably derive it from the data (a topic for the future discussion). Our model assumes that any action - such as the pass - aims to advance the ball. Merely carrying the ball forward would do the same thing. We also need to consider the locations of the events; penetrating in the center is different from that on the flank. Controlling the ball in the initial third is also distinct from doing the same thing in the mid third. Just taking all these considerations, we already create 12 different variables for the completion of a possession (dribble vs. long pass vs. short pass, center vs. flank, initial vs. middle third).

The second implication of the model is that teams with fewer possessions rely on actions other than the pass and the dribble to complete the possession. If you look at Fig 4 again, you will see that a lot of the over-estimations happen in the under-performing teams like Montreal, San Jose, or Colorado. Many of these teams sit back, suffer, and rely on the counter-attack. Therefore, they don’t play with a lot of positional structure in the offensive phase. You don’t pass a lot when you don’t have teammates in proper positions. These teams prey on the open space the opponent leaves behind. You can imagine during a counter-attack; you don’t really pass or dribble that much. You carry the ball and run for a long time because you just don’t have a lot of passing options. The model captures that aspect of the game.

If we approach the game like a string of successful events, we also have a chance to determine what is the limiting factor(s) in the chain; for example, a team may attempt too many unsuccessful dribbles. You can find out the difference by merely measuring the dribble success rate, but putting that value into a model like ours will help us to estimate how much damage it is doing to the team’s chance creation.

As we group more and more variables into the possession, we will also need to deal with a subtler difference in each measure. At some point, we will need to assess the distribution of possessions and the statistical significance of differences.

Grouping the action events into the possession also makes the soccer game more like a basketball match with a precise definition of the offensive and defensive phases. The two games have always been considered the opposite of each other: a soccer team averages one to two goals while a basketball team makes around 50 field goals. On the surface, they have a 50-fold difference. But if we consider the number of possessions, then the games look more like each other. They both have about 100 possessions per team. We may be able to apply all those well-defined basketball analytical tools in the soccer game.

We have an embryonic model, but hopefully, it represents something close to a starting point for more optimization in the future.

American Soccer Analysis

Tiny Differences: How Changing Small Things Can Have Big Consequences

American Soccer Analysis