View the Project on GitHub onthestairs/k-means-clustering-football
The k-means clustering algorithm attempts to partition a set of data into k clusters by choosing k 'centre points' in such a way that the sum of the distances from each point to its nearest centre point is minimised, and then assigning each point to a cluster based on which centre it is closest to. Stated more simply, the algorithm attempts to partition the data set into k partitions such that everything in each cluster is similar to each other, but dissimilar to everything in each other cluster. It is an algorithm similar to that which Google uses to group news articles which are on the same story.
Upon learning about this algorithm, I thought about the Manchester City Analytics Dataset released a couple of months ago. This was perfect to analyse. I would use the k-means clustering algorithm to divide the players up into groups of 'similar' players.
The dataset is a csv file which contains a list of player performances - one row per performance for each player. The first task was to parse the data. I created a Player object which holds all the stats for that player and only captured the numerical stats of performance in the game and ommitted the meta stats ("Team Id", "Venue", etc). Then, I summed these stats up and divided by the number of games played to produce an 'average number of x per game' style stat.
The next step was to normalise this data so stats which were high (passes per game for example) weren't overweighted against stats which were low (tackles per game for example). I did this by letting each stat be equal to the linear position of that stat between the lowest and highest stat of that type amongst players. This meant each stat was now a value between 0 and 1.
From here, we can use the k-means algorithm to divide the players into k groups, each with a centre, minimising the distance between each player and its closest centre. As the algorithm depends on random starting conditions, I ran the algorithm 5 times and chose the best clustering. Typically, the clusterings ended up very similar.
I then looked to find out what the 'key' features of each cluster were. This was calculated by seeing how different each stat of the centre of a cluster was to the same stat of the centre of all the other clusters. The more different the stat was, the more 'salient' to the cluster I declared the stat. In other words, the salient features of a cluster are those which define them against the other clusters.
Initially upon doing this, I found that certain clusters were being arranged due to certain players propensity for taking free-kicks or corners. I decided that these stats weren't relevant to how a player played during open play, so made an option within the program to discard them.
I also made an option to filter the players by the meta stats. This allowed me to discard goalkeepers, players that came on as a substitute, limit to only midfielders etc. Another tweak I made was to only allow players who had more than 10 games played.
I was initally curious to see how the players would be divided if we ran the algorithm on all outfield players with the number of clusters being 10 (i.e the number of outfield players on the pitch). Here are the results. The salient features are ordered by how indicative they are - green means the players in that group typically have a higher stat for that feature than players in other clusters, red means the layers in that group typically have a lower stat for that feature than players in other clusters.
Big Chances, Attempts Open Play on target, Goals from Inside Box, Shots On from Inside Box, Goals Open Play, Shots On Target inc goals, Goals, Shots Cleared off Line, Offsides, Shots Cleared off Line Inside Area
Wayne Rooney, Emmanuel Adebayor, Fernando Torres, Shane Long, Danny Welbeck, Scott Sinclair, Clint Dempsey, Theo Walcott, Bobby Zamora, Demba Ba, Danny Graham, Sergio Agüero, Darren Bent, Yakubu, Peter Odemwingie, Daniel Sturridge, Robin van Persie
This appears to be a group of strikers. It's notable that both Walcott and Sturridge, two wingers who would prefer to play as strikers, are in this group.
Headed Clearances, Total Clearances, Other Clearances, Blocks, Touches open play final third, Dispossessed, Unsuccessful Long Balls, Key Passes, Pass Backward, Unsuccessful Ball Touch
Leon Barnett, Gareth McAuley, Chris Smalling, Wes Brown, Scott Dann, Michael Turner, Daniel Gabbidon, Carlos Cuéllar, Zak Whitbread, James Collins, Michael Williamson, Gary Cahill, Christopher Samba, Gaël Givet, Sylvain Distin, Anton Ferdinand, Richard Stearman, Richard Dunne, David Wheater, Tim Ream, Grant Hanley, Zat Knight, Clint Hill, Steven Taylor, Johnny Heitinga, Christophe Berra, Roger Johnson, Jonas Olsson, Russell Martin, Phil Jagielka, Robert Huth, Ryan Shawcross, Elliott Ward
This appears to be the group of centre backs who play for teams who don't always pass it out from the back. It is interesting to find Chris Smalling in this group - all the other Manchester United centre backs are in another cluster seen below.
Unsuccessful open play crosses, Unsuccessful crosses in the air, Successful open play crosses, Successful crosses in the air, Key Passes, Touches open play final third, Assists, Blocked Shots from Inside Box, Pass Backward, Total Unsuccessful Passes All
Matthew Jarvis, Stewart Downing, James McClean, Jean Beausejour, Juan Mata, Antonio Valencia, Gareth Bale, Gylfi Sigurdsson, Martin Petrov, Damien Duff, Nani, Charles N'Zogbia
Traditional wingers. Put a lot of crosses in. It is surprising to see that Sigurdsson is in this group, having played for most of the season as a central player.
Unsuccessful crosses in the air, Unsuccessful open play crosses, Successful open play crosses, Key Throw In, Successful crosses in the air, Tackles Won, Total Unsuccessful Passes All, Touches open play opp six yards, Shots On from Inside Box, Blocked Shots from Inside Box
Chris Brunt, Bacary Sagna, José Enrique, Sebastian Larsson, Pablo Zabaleta, Elliott Bennett, Kyle Walker, José Bosingwa, Phil Jones, Morten Gamst Pedersen, Ryan Taylor, Gaël Clichy, Jordan Henderson, Glen Johnson, James Milner, Leighton Baines, Joey Barton, Ryan Shotton
This group seems to be mainly composed of full backs but with a couple of midfielders.
Pass Forward, Ground Duels lost, Foul Won Penalty, Unsuccessful Passes Middle third, Duels lost, Unsuccessful Long Passes, Total Unsuccessful Passes All, Successful Long Passes, Unsuccessful Ball Touch, Fouls Won in Danger Area inc pens
Mark Davies, Jonás Gutiérrez, Wes Hoolahan, Paul Scharner, Kevin Doyle, Leon Osman, Andrew Crofts, Micah Richards, Stéphane Sessegnon, Ramires, Tim Cahill, Gabriel Agbonlahor, Shaun Wright-Phillips, Jonathan Walters, Junior Hoilett, Mauro Formica, Jamie Mackie, Andrew Surman, Victor Moses
I find it hard to intuitively categorise this group. It appears to be mainly central midfielders and wingers. Micah Richards looks a bit out of place in this group.
Tackles Lost, Tackles Won, Recoveries, Successful Long Passes, Successful Passes Middle third, Interceptions, Total Successful Passes Excl Crosses Corners, Total Successful Passes All, Pass Left, Successful Short Passes
Patrice Evra, Youssuf Mulumbu, Moussa Dembélé, Scott Parker, Stilian Petrov, Stephen Ireland, Benoit Assou-Ekotto, Mohamed Diamé, Karl Henry, Lee Cattermole, Yohan Cabaye, Alexandre Song, Michael Carrick, Bradley Johnson, Joe Allen, Ashley Cole, Angel Rangel, Craig Gardner, David Edwards, Cheik Tioté, Steve Sidwell, Nigel Reo-Coker, Steven N'Zonzi, Danny Murphy, Leon Britton, James McArthur, James McCarthy, Glenn Whelan, Jay Spearing, Gareth Barry, Marouane Fellaini, David Fox, Alejandro Faurlin
A group of 'tackling' midfielders. Typically seem to be those who sit deep Interesting to note again a couple of full backs in a group of midfielders.
Unsuccessful Long Balls, Unsuccessful Long Passes, Attempts Open Play off target, Attempts Open Play on target, Successful Lay-Offs, Dispossessed, Shots Off Target inc woodwork, Unsuccessful Passes Defensive third, Shots On Target inc goals, Key Passes
Andy Wilkinson, Steven Reid, Ronald Zubar, Alan Hutton, Armand Traore, Emmerson Boyce, Bradley Orr, Chris Baird, Grétar Steinsson, Dean Whitehead, Marc Wilson, Sam Ricketts, Neil Taylor, Taye Taiwo, Liam Ridgewell, Kyle Naughton, Billy Jones, Nicky Shorey, Philip Neville, Stephen Warnock, Jason Lowe, Tony Hibbert, Stephen Ward, Shaun Derry, Jack Colback, Danny Simpson, Davide Santon, Phillip Bardsley, Dedryck Boyata, Stephen Kelly, Luke Young, Martin Olsson, Nedum Onuoha, Maynor Figueroa, John O'Shea, John Arne Riise, Kieran Richardson, Marc Tierney, Paul Robinson
This cluster is made up of full backs who tend to play for teams who are lower in the table.
Through Ball, Successful Passes Opposition Half, Successful Passes Final third, Key Passes, Pass Left, Total Successful Passes All, Total Successful Passes Excl Crosses Corners, Successful Short Passes, Shots Off Target Outside Box, Blocked Shots Outside Box
Aaron Ramsey, Jamie O'Hara, Yaya Touré, Luka Modric, Steven Pienaar, Mikel Arteta, Samir Nasri, Frank Lampard, Tomas Rosicky, Raul Meireles, Ryan Giggs, Steven Gerrard, Adel Taarabt, James Morrison, Charlie Adam, David Silva
Creative centre midfielders.
Successful Passes Defensive third, Successful Passes Own Half, Touches open play final third, Dispossessed, Key Passes, Ground Duels lost, Headed Clearances, Unsuccessful Ball Touch, Attempts Open Play off target, Total Clearances
Gary Caldwell, Ashley Williams, Younes Kaboul, Steven Caulker, Laurent Koscielny, Jamie Carragher, Joleon Lescott, Daniel Agger, Brede Hangeland, Garry Monk, Branislav Ivanovic, John Terry, Jonny Evans, Per Mertesacker, Martin Skrtel, Vincent Kompany, William Gallas, Thomas Vermaelen, Antolin Alcaraz, Aaron Hughes, Rio Ferdinand, Ledley King, Fabricio Coloccini, David Luiz, Philippe Senderos
Ball playing centre backs. It is interesting to note Coloccini is the only player without a teammate in this group. Perhaps good evidence to say he would fit in to a team more focussed on ball retention?
Headed Shots On Target, Unsuccessful Flick-Ons, Successful Flick-Ons, Headed Goals, Aerial Duels lost, Touches open play opp six yards, Unsuccessful Lay-Offs, Offsides, Unsuccessful Ball Touch, Headed Shots Off Target
Steven Fletcher, Peter Crouch, Steve Morison, Grant Holt, Andy Carroll, Luis Suárez, Nicklas Bendtner
Target men. Curiously we also see Suárez, a player we would not intuitively expect to find in this group.
Next, I decided to see how the strikers in the league would be partitioned. I chose k to be 3. I included all players with more than five 90 minutes. Here are the results.
Big Chances, Goals Open Play, Shots On Target inc goals, Shots On from Inside Box, Attempts Open Play on target, Goals from Inside Box, Goals, Goals from Outside Box, Shots Off Target inc woodwork, Shots Off from Inside Box
Robin van Persie, Jermain Defoe, Wayne Rooney, Danny Welbeck, Clint Dempsey, Demba Ba, Luis Suárez, Mario Balotelli, Sergio Agüero, Edin Dzeko, Emmanuel Adebayor
Group who get a lot of shots off.
Total Clearances, Take-Ons Overrun, Headed Clearances, Other Clearances, Blocked Shots, Headed Shots On Target, Successful Flick-Ons, Blocked Shots from Inside Box, Unsuccessful Flick-Ons, Dispossessed
Bobby Zamora, Grant Holt, Steven Fletcher, Shane Long, Didier Drogba, Peter Crouch, Darren Bent, Steve Morison, Tim Cahill, Andy Carroll, Yakubu, Gabriel Agbonlahor, Danny Graham, Kevin Davies, Jonathan Walters, Heidar Helguson, Nicklas Bendtner
Target men.
Total Clearances, Goals from Inside Box, Headed Clearances, Headed Shots On Target, Touches open play opp six yards, Headed Goals, Goals, Goals Open Play, Attempts from Corners on target, Big Chances
Stéphane Sessegnon, Moussa Dembélé, Kevin Doyle, Peter Odemwingie, Andrew Johnson, Junior Hoilett, Fernando Torres
Strikers who don't get many goals, whether due to poor form or design. Surely Torres of 09/10 wouldn't find himself in this group.
Next I shall look at central midfielders at teams which finished in the top 6 last season. I chose k=4.
Offsides, Goals from Corners, Attempts from Throws off target, Shots Off from Inside Box, Big Chances, Headed Blocked Shots, Unsuccessful Lay-Offs, Take-Ons Overrun, Touches open play opp box, Winning Goal
Sergio Agüero
Opta questionably defines Agüero as a midfielder (at the top of a 4-2-3-1). Clearly he is more of a goal scorer, and so very different from the other players. He gets his own cluster.
Headed Shots On Target, Assists, Through Ball, Successful Flick-Ons, Goals from Corners, Goals from Outside Box, Key Passes, Shots Cleared off Line, Shots Cleared off Line Inside Area, Successful crosses in the air
Rafael van der Vaart, Frank Lampard, Aaron Ramsey, Ryan Giggs, Tomas Rosicky
Attacking playmakers. Create a lot of chances.
Touches open play opp six yards, Shots On from Inside Box, Goals from Inside Box, Big Chances, Headed Blocked Shots, Dispossessed, Unsuccessful Lay-Offs, Goals Open Play, Winning Goal, Touches open play opp box
Gareth Barry, Michael Carrick, Darren Fletcher, Yohan Cabaye, Yaya Touré, Raul Meireles, Mikel Arteta, James Milner, Scott Parker, Nigel de Jong, Luka Modric, Alexandre Song, Danny Guthrie
Deep lying playmakers.
Headed Clearances, Total Clearances, Aerial Duels won, Goals Open Play, Blocked Shots from Inside Box, Touches open play final third, Headed Blocked Shots, Goals, Goals from Inside Box, Winning Goal
Michael Essien, Ramires, Cheik Tioté, Sandro, Oriol Romeu, John Obi Mikel
Defensive midfielders. Four are Chelsea players which indicates as a team they have a different approach to the other top 6 teams. It seems Tioté could be a good signing to replace Essien.
I then decided to look at teams instead of players. I aggregated all the stats from individual players into an overall team statistic. I decided to see if a split into two groups would align with the final league position. This yielded an illuminating partition.
Through Ball, Successful Short Passes, Successful Passes Middle third, Total Successful Passes Excl Crosses Corners, Total Successful Passes All, Pass Right, Big Chances, Pass Left, Attempts Open Play on target, Successful Passes Own Half
Liverpool, Fulham, Manchester City, Manchester United, Arsenal, Tottenham Hotspur, Chelsea, Swansea City
The top teams. Swansea who finished 11th are the lowest placed team in this group.
Through Ball, Successful Short Passes, Successful Passes Middle third, Total Successful Passes Excl Crosses Corners, Total Successful Passes All, Pass Right, Big Chances, Pass Left, Attempts Open Play on target, Successful Passes Own Half
Everton, Wolverhampton Wanderers, Norwich City, Bolton Wanderers, West Bromwich Albion, Queens Park Rangers, Wigan Athletic, Newcastle United, Aston Villa, Blackburn Rovers, Sunderland, Stoke City
Mainly composed of teams in the bottom half. It is interesting that 5th placed Newcastle are in this cluster. They are stylistically different from the other top teams.
As shown from the examples above, this kind of partitioning can give us an interesting insight into the types of players footballers are. Possible real world applications could incude identifying players who would be better off playing in a different role, identifying players who don't fit in with a system and also identifying replacement signings.
Each of the stats provided in the data set are ones of only magnitude. There are no stats such as pass completion rate, only successful passes and unsuccessful passes. These could be fairly easily calculated by the program and included as a feature of the player.
Another weakness is that each stat is given an equal weight. This means that if there are ten stats regarding passing, and five regarding tackling, the algorithm treats the passing stats are twice as important. This could be solved by subjectively assigning a weight to each statistic.
I have tried to make the code easily adaptable so feel free to fork the code and run your own clusterings. If you do not have the technical skills but have a group of players you would like to partition (for example, substitute full backs away from home into 5 groups), email me and I would be happy to run it for you.