For technical directors and front offices in professional soccer, finding the next great player to fit their team’s style of play is of utmost importance. Gone (we hope) are the days of picking up players just to fill a roster. Each move is calculated, aimed to fit a “system” of play that each manager, general manager, technical director, and assistant coaches have designed in order to give their team the best chance at victory. This change from the old days of soccer puts immense pressure on scouting directors and general managers to fill their team efficiently, gaining maximum talent from minimal pay; all while conforming to the manager’s desired style of play.

Efficiency becomes more important for small-market teams that do not have the budget of the massive clubs they have to compete against. These various demands force scouting directors and general managers to have transfer options ready to go at a moment’s notice. With thousands of professional soccer players across the globe, how is it possible for this list to be constructed without missing anyone that could turn out to be the next big superstar for your team?

When a team goes to the transfer market in search of their next player, front office staff often use phrases like, “I want a player that plays similar to this player.” Coaches and General Managers are often searching for a player that plays similarly and has similar attributes to another player, usually one that has impressed across the league and passes the critical eye test. These sentiments are often attributes that are easily measured: involvement in the game, creativity, passing ability, speed, and more. In an orthodox search for a player on the market, it takes hours upon hours of video to find a player that performs somewhat similarly to the original player you want to find. This inefficiency is a detriment to teams and front offices in the long run not only because of the time wasted, but also because there is no objective way to compare players to one another.

Using data effectively helps teams rid themselves of inefficiency in player recruitment, speeds up the process in finding players, and creates further accuracy when measuring players. There is a type of math in the sub-field of machine learning called clustering that allows us to measure players against one another and more importantly, group similar players together. This process saves time and money, and helps provide an objective view of not only the level of the player, but also the style of play of the player we are trying to find.

For the purpose of this article, we are going to be clustering around Forwards in the MLS as a way to illustrate how to create both an initial cluster of players, and a second more targeted and specific cluster that will provide better results for what we are looking for.

What is Clustering?

Clustering allows us to group similar things together through a process of unsupervised machine learning. In short, the algorithm will look for groups and patterns in the data that are not always readily available to the naked eye (You can read more about the technicalities and math behind clustering here).

Clustering can be used for a multitude of aspects other than finding players that play similarly, including grouping together teams with the same tactics, passing styles, and more. Clustering possibilities are truly infinite.

To reiterate from above, for the purposes of this article, we are going to use clustering as a means to explore similarities in style of play and levels of performance in Major League Soccer forwards in 2018 and 2019.

Reviewing the Raw Information

Before starting the clustering process, or any statistical or machine learning analysis for that matter, it is vitally important to start by going through the raw data to analyze what metrics are important and what might be missing, in order to gauge the granularity of the data, and to get a general overview of what you can and cannot accomplish with the data provided. This data will often be provided by a club or federation, but if it’s not, it can be found on websites like American Soccer Analysis that provide data to the public for free.

Because we are clustering around forwards in this article, reviewing the raw data will allow us to remove extraneous variables that we know the coach or front office staff we are trying to help will not care about. In this case, we would know to remove metrics such as Successful Header Percentage in The Defensive Third, Tackles Made in the Defensive Third, as well as a few others that are defensive in nature and are not pertinent in measuring a forward’s overall ability.

In addition to removing extraneous variables, reviewing raw data before going into a complicated and time consuming analysis can not only save time in the backend of things, but provide a space and time to perform basic exploratory analysis in the beginning of the process. Exploratory analysis can both give insight into the problem you are trying to solve as well as give you a more holistic look into the problem at hand.

Clustering Put Into Action: The First Round

Out of the metrics that ASA provides, it is necessary to choose those that are going to be most relevant to forwards. Referring to the discussion above, the following metrics appeared to be the most relevant to accurately measure the overall performance of forwards (check our glossary for definitions of each term):

Pass Percentage, Average DistanceOf Forward Passes, Average Percent Of Passes That Are Forward, Touch Percentage, Total Passes Completed, Average Shot Distance, Percentage of Unassisted Shots, Average xG Per Shot, Average xA per Pass, Goals - xG, Assists Minus xA, Shots On Target, Goals, xG, Key Passes, Assists, xG+xA + Passes Attempted, Number of Chains Involved In, Percentage of Overall Chains Involved In, Number of Chains Involved in Taking the Last Shot, xG Per Chain

After selecting these metrics, the next step is figuring out how many groups or clusters to separate into. According to the “elbow method,” a common way to figure out the best amount of clusters to use, the most appropriate amount of clusters for our grouping of data is four (you can read more about the elbow method here).

Next, we plug the above metrics into an algorithm that performs K-Means clustering, a type of clustering that will efficiently, “group similar data points together and discover underlying patterns.” After running the first clustering algorithm, these were the initial results for what would be considered across MLS as the forward cluster with the most notable names and the “best” forwards. Note that the data is split into seasons to provide further granularity and understanding of performance trends from year to year.

Adama Diomande-2019
Bradley Wright-Phillips-2018
Danny Hoesen-2019
David Villa-2018
Diego Rubio-2018
Gustavo Bou-2019
Heber-2019
Jozy Altidore-2019
Kacper Przybylko-2019
Luis Silva-2018
Mauro Manotas-2019
Sebastian Giovinco-2018
Sergio Santos-2019
Valentin Castellanos-2019
Wayne Rooney-2018
Wayne Rooney-2019
Yordy Reyna-2018
Zlatan Ibrahimovic-2019
Zlatan Ibrahimovic-2018

After the initial clustering analysis we see some unsurprising names such as the legendary Wayne Rooney and Zlatan Ibrahimovic, but also some smaller names around the league, including Kacper Przybylko, who had a phenomenal 2019 season. Furthermore, we see that a big name, Josef Martinez, is left out of the “top” cluster (We’ll discuss why he was left out below).

Once the first round of clustering is complete, it is important to perform a principal component analysis, which allows for a more accurate understanding of which variables most affected the cluster assignment of each player.

The following variables were the most important in affecting which cluster a player was assigned to:

xGPerGame, GoalsPerGame, ShotsOnTargetPerGame, Percent of Touches in Buildup, and Number of Buildup Chains Involved In Per Game

Following the results from the first clustering and the first principal component analysis, we have a better understanding of what style of play attributes affect what cluster a player is assigned to. Combined with the expert opinions of the front office, these “important variables” listed above will help us cluster with greater accuracy in the second round, where we will search for a cluster of Forwards that are the best finishers in the league.

Clustering with Greater Accuracy: The Second Round

As a basic rule of thumb, the second round of any clustering should try and attain greater accuracy than the first round. Clarity and communication between the manager, front office, and the data scientist becomes incredibly important, as the data scientist should try and incorporate both the results from the first analysis and the goals and strategy of the front office in their next clustering process.

For example, If the club is in the market for a forward who is an excellent finisher, the “top” cluster should reflect as such. If the club is in the market for a forward who can hold the ball up and provide for wingers, that is what the clustering process should reflect. Like all machine learning and data analysis, clustering is dependent on the needs of your team. Communication of goals needs to occur or else clustering can only maintain a generic, topline level of usefulness rather than being specific to the needs of the club.

For the purposes of the second round of clustering, we are going to cluster around excellent finishers. These players do not necessarily need to be involved at a high rate in buildup, but are expected to be the main source of goals for their team and finish at an elite level. In technical terms, more weight will be provided to the variables that are related to, and representative of finishing.

The variables which we add weight to are G-xG, G-xGPerShot ( you can read my last article on the importance of a high xG per shot here), and Goals Per Game.

After reclustering and weighting variables to reflect the best finishing forwards, the “top” cluster includes the following players.

Adama Diomande-2018
Adama Diomande-2019
Bradley Wright-Phillips-2018
Brian Fernandez-2019
Chris Wondolowski-2019
David Villa-2018
Diego Rubio-2018
Heber-2019
Josef Martinez-2018
Josef Martinez-2019
Jozy Altidore-2019
Kacper Przybylko-2019
Kei Kamara-2018
Mauro Manotas-2018
Raul Ruidiaz-2018
Sebastian Giovinco-2018
Valentin Castellanos-2019
Wayne Rooney-2018
Zdenek Ondrasek-2019
Zlatan Ibrahimovic-2018
Zlatan Ibrahimovic-2019

The Josef Martinez Example

For those of you reading this article, you are probably asking the question: “How in the world can Josef Martinez, arguably one of the best forwards in MLS, be left off of the top list in the first cluster?” After initially asking that question myself, looking back through his metrics provided clarity about why he was not involved in the “top cluster” in the first run through.

While Josef had the finishing ability that was on par with players such as Zlatan Ibrahimovic or Wayne Rooney, his involvement in buildup play was way below average. This unweighted clustering placed him in the cluster with other forwards such as Chris Wondolowski and Ola Kamara who’s style of play is less involved in buildup than players such as Rooney or Bradley Wright-Phillips. Once weights were added to the variables in order to better represent players that finish at a high level, regardless of their amount of time spent in buildup, Josef Martinez was in the “top” cluster, where he belongs.

Limitations of this Article

It is important to acknowledge the limitations of clustering in this article in order to better understand its applications and future use. Currently, we are using event-based metrics rather than tracking-based metrics to measure players. This means that data is generated through events on the field and their respective locations rather than looking at the player locations first. This limits what we are actually able to measure. With tracking-based data we would be able to create tracking-based metrics that we currently cannot measure. These metrics include players in the line of the shot, players between the ball and the goal, line breaking passes, and more. With tracking data the options are essentially limitless.

Furthermore, positionality also affects our clustering process. For this article, I imagined a clustering of Forwards, meaning a traditional “Number 9.” In order to do this, I filtered for positionality in the American Soccer Analysis database. This left me with only the players playing in a traditional “Number 9” role, rather than all players that have played as a forward. While this aspect of positionality may seem to be a minor inconvenience, in the long run it will hurt our understanding of the game and player recruitment efforts as it does not account for formations where there is not a traditional forward (False 9) and does not include players that play in teams or formations where it does not truly matter if they’re technical position is out wide or central due to their active style of play. In sports like basketball, a player’s position is often more concrete. You know who the center is and you know who the point guard is. In soccer, with so much overlap and the lack of concrete play calling, it becomes more muddled and fluid.

Without tracking based data, the only position groups that should remain realistic targets for cluster analysis are offensive ones, such as Forwards, Wingers, Central Attacking Midfielders, and possibly central midfielders. The difficulty level of tracking defender ability in different situations without further context, makes clustering around their true style of play very complicated. So much of being a successful defender is positioning and communication between the center backs, in addition to the relative space between the goalkeeper and the defenders. These aspects of the game are things that cannot be measured solely with event data.

Lastly, and most importantly, the players included in this article are only players that have played in the MLS over the past two seasons. In order to be of use in a team’s player recruitment, it is necessary to have data from leagues across the world, including both major leagues and non-major leagues. This allows a team to continuously check and double check how their metrics and clustering perform and also gives them the ability to find players in lower leagues that have the same style of play as some of the big names in major leagues. This process gives teams the ability to find players that are going to over perform what they are being paid and set the team up for long term success,; the true goal of player recruitment.

Conclusion

Clustering, an unsupervised form of machine learning, is an excellent way to help find the next diamond in the rough for your team in the player recruitment process. It is fairly straightforward, can be easily explained, and can help find players with a similar style of play to other players across the globe. It is currently how the big name soccer data companies create likeness scores and compare one player to another. However, this can be done in house for a smaller cost in the long run, and can also be more personalized exactly for each front office and technical staff’s needs. Using clustering and data science as a real means for player recruitment will aid in making objective decisions that will help your team minimize threat and maximize upside in the long run.

None of this would be possible without clubs taking an active jump into the field of analytics. The perfect combination of communication, the eye-test, and machine learning and data science will bring a club to the next level of player recruitment and game analysis.