Monday, 29 January 2018

Data Revolution in Football: Some Thoughts

I did this interview with James Lorenzo a while back for a series that the video team over at Football Whispers have been working on called ‘The Science of Football’ which has figured prominent names of football data analytics like Omar Chaudhuri, David Sumpter and Chris Anderson.

I thought I’d use the video coming out as an opportunity to put in writing some thoughts and arguments which I would have wanted to make had I not gotten so flustered talking in front of a camera in a language that is not my first.

James asked me about passion in football, an element which in his words is many times “thrown at the stats community as something you can’t measure”, which forms the basis of many arguments to discredit the use of data analytics in football. My honest answer to the question of whether we can use data to measure passion on the pitch should have been “Uuuhhhm, maybe? Maybe if we clarify what exactly you mean by passion we can try to talk about it…”. Doesn’t make for a compelling viewing though. I can maybe indulge here though: maybe passion has something to do with trying hard, tracking back in defence even when you have over-committed in attack. If this is the case, then we potentially can: a passionate player will perform more of these actions, and if these actions somehow directly improve his team’s chances of winning matches (for example by defusing a dangerous counter-attack), then a performance model should pick up on this signal.

What about then if instead of directly improving a team’s performance in such a tangible way (stopping a dangerous situation for the opposition), passion shines through in much more innocuous situations. Perhaps the passionate player hounds his mark into the touchline and makes him stumble over himself and give away a throw-in on the halfway line. This throw-in will hardly raise any direct signal on the outcome of the match, right? But perhaps these sort of actions inspire the passionate player’s teammates and riles up their energy which is why he is important to the team, a coach might argue. Well, if that’s the case, then an intelligent modeller might still be able to pick up on this signal: if the effect is real, then teams that have many of these actions (opposition player hounded out of play) should win more matches. Notice that the model doesn’t even have to know that its picking up on passion, but it would still rate these players higher.

I’m obviously being overly simplistic with the examples above to prove the point that everything that happens on a pitch can in perhaps subtle but very real ways creep into signals that are detectable in data. There’s also the very real chance that I’m wide off the mark here. Perhaps coaches pick up on the signal of what they call passion in the dressing room, when a player is an excellent motivator for his team-mates. I’m pretty sure Opta hasn’t started coding events in the dressing room yet. Or perhaps the experienced coach can pick up on subtle signals of a player’s body language and know when he’s trying hard in a way that the data we have available simply can’t. After all, there are so many many fields in which the human brain is excellent at picking up signals, much better than data analytics. But as amazing as our brains are, their processing of information is also incredibly prone to mistakes and omissions that can throw us way off.

Take the example of the diagnosis of ‘female hysteria’ by British physicians of the 19th century. I’m using this overtly tongue-in-cheek example because doctors in 19th century society were surely by no standards ‘stupid’ people, but probably quite the opposite; they must have been the brightest people of their time. Nevertheless, no matter how intelligent these men must have been, the (in hindsight) obviously ridiculous diagnosis of female hysteria was freely dished around to any woman feeling a wide array of symptoms. The doctors were simply too accustomed to a term that had always been used in the medical circles of their time, and their minds couldn’t fathom to challenge it. They had seen it many times before, knew what is was about and its characteristics, and they knew that a woman was suffering from hysteria when they examined her. These intelligent men of the 19th century weren’t actively being stubborn or resisting to change, it is simply human nature to act in this way with so many biases that we wouldn’t even believe.

Now, the key point that I want you to understand is that, in my mind, this reflection in no way undermines or dismisses the knowledge of medicine of these 19th century physicians. Truth be told, if you were involved in a freak accident tomorrow and were bleeding badly and needed urgent medical attention by a passerby, who would you rather that passerby be: a 19th century physician or me reading off wikipedia smugly how wrong these doctors were in their diagnoses? Not me? Really???

Similarly, I in no way dismiss the organic knowledge about football of coaches and scouts. I love football, and I would go gaga if I ever got to meet Pep Guardiola or Arsene Wenger in person. I totally admire coaches and love to hear them speak about the game. So trust me, my point is nowhere near what some skeptics think people like me who do football analytics think: “these dumb coaches know nothing about football, they’re ignorant cavemen”. My point is simply that, if you were in that freak accident, what you really want for a passerby is a 21st century doctor who has gone to medical school and embraced the sophistication and advancement of his trade, and is open and willing to get medical knowledge from all sources which can provide it, be it from his super experienced mentor who was a doctor back in the 19th century, or from the latest cutting edge medical studies. You want a doctor who understands that medicine is the subject of study of a wide range of domains of knowledge, and who can pick and choose how to combine these different sources into making sure you survive.

I think the “numbers can’t understand passion” argument is a lazy one made by people who simply don’t have any sort of grasp of what data science actually is. I also think it is a dangerous one to make because for the younger generations of coaches coming through now, it will be easy to cling on to that meaningless phrase as an excuse to disregard and not make an effort to understand something which they perhaps find challenging, and they will inevitably be left behind as football undergoes this new wave of sophistication.

Thursday, 25 January 2018

LDA Model to Learn and Assign Team and Player Playing Style

I wrote this paper to be submitted for the 2018 MIT Sloan Sport Analytics Conference paper competition, featuring some interesting research that’s being done inside the Football Whispers Data Science team. Sadly, the paper didn’t make the final cut for the conference, but I figured I’d share it here for everyone to see as the methodology is pretty interesting and the applications have the great attribute of being both profound and accessible for everyone. Hopefully we’ll see some of these applications featuring in the Football Whispers site and editorial content soon.
If you can’t be bothered to read the whole paper (which can get unnecessarily technical to be fair), here’s a re-write of its main content written with a less ‘academic’ narrative.
We all know that football is a complex subject to study under the lens of data analysis and statistics. Football is a dynamic and fluid invasion sport, with highly interdependent events occurring simultaneously and continuously. As a consequence, a lot of applied research into football data falls into the trap of being non-descriptive; essentially a “black box” that seems to declare rigid answers that rarely trickle down in descriptive, applicable or accessible ways to short-term stakeholders like clubs and coaches interested in winning the next game or planning their tactical system. Perhaps this flaw goes a long way in explaining why football analytics meets such resistance from coaches, scouts and others when in reality, both lines of work should embrace each other as treasured allies working together towards the same goal.

What football analytics (or at least a reasonable portion of it) should really aim for is to balance out both objectives: it should leverage the insight to be extracted from huge amounts of complex data that the brains of coaches and scouts cannot process, but should also make sure its results are descriptive and accessible for the more organic domains of knowledge that coaches have about the game. If analytics knowledge and coaching knowledge are incompatible, it is ultimately football that is missing out.

The methodology that we used for this paper is inspired from an area of research that faces very similar circumstances: Natural Language Processing (NLP). More specifically, Topic Extraction, which is concerned with automatically sorting text documents into the different semantic topics that constitute it. In the information age, automatic methods for classifying large amounts of documents semantically are incredibly practical. Burgeoning fields such as digital marketing and sentiment analysis of social media content rely heavily on the scalability of text mining: most humans can classify tweets about a brand into different ‘sentiment categories’; but, just as in the case of our treasured coaches, the manpower needed to do this across the vast quantities of data available is completely unfeasible. Likewise, if the results from Topic Extraction don’t line up with natural categories and sentiments that human readers would sort documents into, it makes them extremely hard to leverage for say a marketing agency.

Our research takes a page out of the Topic Extraction handbook and repurposes one of its foremost models called Latent Dirichlet Allocation (LDA) to be applied to football data. I won’t get into the exact details of our conceptualisation of styles in football within the framework of topic extraction, but the basic idea is that just as different words in documents determine the document’s topic, so too should different features in a football match determine the style of the teams/players involved. When an LDA model is fit to a set of documents, it will produce a set of ‘topics’ characterised by key words. The image shows the key words associated to each topic of an LDA model fit to a large set of news articles.

By looking at the key words we can naturally imagine what subject each ‘topic’ refers to. In the example, topic 1 corresponds to technology articles, topic 2 probably identifies sales ads while topic 3 includes news articles of Eastern Europe, etc. LDA’s major plaudits stem from the fact that it is unsupervised: the algorithm picks up on these reasonable and natural semantic topics that differentiate the different news articles with zero human intervention.

After a model is fit, it can be used to process new documents and will classify them as a mixture of the bespoke topics, which basically means it will say things like “this document is 30% technology, 40% sale ad, etc”. It does this by analysing the frequency of different key words appearing in the document. The thinking process behind our methodology should have become clear by now: an LDA model will analyse the frequency of features of a team in a match, such as ‘long ball into opposition half’, ‘touches’, ‘interception’, etc; and classify it into a mixture of the different styles it learned.

Our model was trained on data from Europe’s big 5 leagues for the 2016-17 and 2017-18 seasons. These are roughly the topics it learned:

With a trained model, we can think of wide range of applications by using the model to classify matches from teams in different contexts. ‘Radar charts’ are a convenient way to superimpose the percentages corresponding to different styles.

REMARK: Radar charts are a common feature in football data analytics, but in a very different use case which might make our visualisations ‘miss the point’ if the main difference isn’t explained: in traditional uses of radar charts, a team/player’s chart can be indefinitely large in all axes of the chart. If an axis is ‘xG’, there’s no real limit on how large this value can be for a team. In our use of radar charts, the ‘percentage’ on each category isn’t related to total volumes but rather relative frequencies and as such they must add up to 100% between all of them, and so a team’s radar simply cannot be very large in all the axes. It is important that the reader keeps all this in mind while interpreting the radar charts below.

A snapshot impression of a team’s playing style can be obtained by plotting the average percentage in which their matches fall into each category.
Another interesting use is comparing the style of match a team plays with or without a player, which can have very real and straightforward impact on team selection and squad management. Paul Pogba, Manchester United’s record signing, is a prime example of a player whose absence is heavily felt in terms of the style.
In direct selection dilemmas, like Danny Rose versus Ben Davies at left-back for Spurs, this methodology can provide context for Pochettino to make his choice:

Notice how the profiles are practically mirrored from each other: when Rose is playing Davies is not and vice versa.

Similarly, we can compare the styles of teams before and after a managerial change to see how the manager has impacted the type of football they play. Allardyce taking over at Everton provides a prime example of a team whose style clearly changed with a managerial shake up.

Marcelino taking over Valencia provides another compelling example considering Valencia’s remarkable upturn in performance this season:


LDA Model for Players

This exact same methodology can be applied to the frequency of features a player performs as opposed to the whole team. An LDA model fitted to data from midfielders across Europes big 5 leagues found the following set of topics:

This result gives us chance to further the point of ‘relative proportions’ rather than ‘absolute volumes’. It would be hard to argue that Mesut Ozil isn’t a proficient passer, or that somehow Fellaini is a more proficient passer than he is! When interpreting these radars the reader has to remember that this is about: what style categories do the player’s features seem to stem the most from in proportionate terms? Ozil can have high volumes of the features related to ‘proficient passing’, and most probably much more than Fellaini. However, proportionately, his features are more aligned with ‘Chance Creation’ than those of ‘Proficient Passing’.

A similar model for defenders is presented below.

As an example, this methodology nicely highlights the impact that playing under Guardiola’s all conquering Manchester City side this season has had on ex-Spurs full-back Kyle Walker.

Finally, this methodology also provides a very clean and clear framework for ‘similar player’ suggestions. When combined with performance ratings for players that we have also developed for players here at Football Whispers, we can imagine a simple and elegant ‘similarity’ concept along these two key axes: playing style and performance level. 

This particular application faces stern competition from many previous methods, like clustering based methods or even what I presented at the Opta Pro Forum in 2017. However, this methodology has the upper hand in the descriptiveness of its suggestion: everyone who uses it can immediately digest exactly what it is saying, as opposed to other ‘similarity’ classifiers which can at times seem “black box-y”.

Lionel Messi Similarity Recommender

The appeal of this approach is how digestible it is. Lewandowski performs to similar standards than Messi but is a different style of player. Dybala and Alexis are similar types of players but don’t perform to the same standards. Neymar and Insigne are both similar in style and in quality.

Jorginho Similarity Recommender

This last application is kind of too simplistic to make any grandiose claims around it, but it has passed every single ‘eye-test’ I have submitted it to so far. Might be worth investigating it more thoroughly as this content evolves on the site. For now though, I’ll leave you with a link to our little taster video for this type of content.

If you like this sort of content then you’ll like what is to come from us over the next year. Make sure to follow the team on twitter: Martin, Bobby and David.

Sunday, 14 May 2017

Paper: 'Flow Network Motifs Applied to Football Passing Data'

I wrote this paper to be presented at MathSport 2017 Conference in Padua University in June. It's a bit heavy on the mathematics in chapter 2 but should be fairly readable from there on. Here's the abstract so you can know if you're interested in reading it before opening the whole document:

"Network Motifs are important local properties of networks, and have lately drawn increasing attention as promising concepts to unearth structural design characteristics of complex networks. In this document, we push the boundaries of the existing body of literature which has used this theory to study soccer passing networks by attempting to uncover unique team passing network structure, and make a rigorous attempt to formalise a theoretical framework in which to carry out and evaluate these analyses. We contribute to the existing body of knowledge by proposing a framework based on repeatability in which to establish the ideal length of flow motifs with which to study soccer passing networks, and also by considering spatial classifications of flow motifs to achieve greater precision in our claim to discover unique team passing network style."

Anyways, here's the link to the pdf on Google Drive:

Thursday, 13 April 2017

What's the ideal length of passing motifs?

In my last entry I pledged to answer this question using the 'repeatability' methodology I presented there. This will be a quick entry to confirm that luckily, we've been right all along and 3 is the ideal length to consider for passing motifs.

The number of passes considered in a passing motifs analysis is a clear instance of the trade-off between detail and comparable structure that we discussed in the previous entry. The Figure below shows the number of motif types that occurred in the 2015-16 season of the Premier League depending on the number 'k' of passes we choose as 'length' from 3 to 7:

When we choose to consider 3 passes, there are 5 motif types which I hope all my readers know by heart by now: ABAB, ABAC, ABCA, ABCB and ABCD. If we choose to go up to 4 and consider one extra pass, there are 15 different types: ABABA, ABABC, ABACA, ABACB, ABACD, ABCAB, etc.

For 5-passes long motifs there are 52 types (all of which occurred at some point in the 2015-16 season), and for 6-passes long motifs there are 203 (of which only 187 types occurred in the 2015-16 data). There were 759 different types of 7-pass motifs in the data. We can appreciate how the number of motif types grows quickly with the number of passes we are considering, which precisely lends itself to losing structure in the noisy haze of excessive detail.

The Figure below shows the number of motifs for the 2015-16 season for each number of passes 'k' considered:

There were 138,432 3-pass long sequences compared to 45,820 7-pass long sequences. While there is a considerable amount less, the data set is still of a decent size to believe that we can extract meaningful conclusions.

Finally, the figure below shows the repeatability percentage as per the methodology of the previous entry for each choice of length from 3 to 7 (as before, we consider relative frequencies of the different categories rather than raw amounts):

3-pass long motifs have a repeatability of about 82.7%; while 6 and 7-pass long motifs have 57.8% and 52.3% respectively. Considering that as per our methodology random methods that carry no structure would have 50% repeatability (equivalent to randomly assigning style), these figures mean that by then we've lost almost all structure.

It's an interesting conclusion that sequences of 3 passes are the ideal number to consider which carry unique team structure better than longer sequences. Considering that passes constitute the grand majority of events on a football pitch, it's a far-reaching conclusion. It provides insight into breaking down the sequentiality of football matches into representative constituent blocks: looking at blocks of about the size corresponding to 3 passes should be best practice.

Wednesday, 5 April 2017

Passing Network Autographs and Overshooting Style

At the end of the last entry I touched on the trade-off between comparable structure and ‘granularity’ or ‘level of detail’ of football data. Imagine this: you have a player who has the ability to pick a certain type of between-lines pass that greatly increases your team’s chance of scoring in that play. With passing data, we could try and identify how this pass is represented in the data and then use the data to identify other players in the world that are also good at making this kind of pass. To do this, we will need to break the data down in a detailed way, and differentiate this type of pass from other vertical passes that perhaps aren’t as effective. We might want to consider where in the pitch this type of pass comes from or where it finishes, what action it is preceded by, where the defenders are, what happens after the pass is made, etc.; all of this in the hope that we can clearly identify the type of pass we’re looking for and tell it apart from other passes. However, what happens when we go too far and use too much detail? It is unlikely that each time our player performs this type of pass he does it in exactly the same way, in the same coordinates of the pitch and in the same conditions. If we start using too much detail, we might actually start classifying instances of this type of pass as different when in reality they correspond to the same sought-after ‘type’. Once this happens we are no longer capable of identifying the “type of pass” we were initially looking for, and now have hundreds of different passes that at this level of detail are all different from each other. We can no longer identify players who can play this pass because this “type” was obliterated in the detail.

I actually began thinking about this issue when reading Dustin Ward’s piece on clustering different types of passes. He decides to take 100 clusters or types of passes and see how often each team or player completes each of those types of passes. This is a good example of the trade-off mentioned above. 100 seems like a good number, it certainly reveals more info about a team than if we considered just 1 (which would basically be like looking at overall Pass Success percentages) or 2 types/clusters of passes.

Choosing 100 is also better than choosing 100,000. If we chose 100,000, then each player or team would perform maximum 1 instance in a season of highly detailed, highly differentiated types of passes. We wouldn’t be able to use this information to compare teams or players in any way. But is choosing 100 better than choosing 120? Or than choosing 80? How do we know when this trade-off is striking the right balance?

The key is having something against which to measure ‘balance’, something we want to optimise. In this entry I’ll show you an example of how this something could be ‘repeatability’:

For a while I’ve been wanting to push the Passing Motifs methodology a bit further and include some spatial information about the passes to see what else it can tell us about teams’ passing networks. Below is an example of two very different instances of ABAC.

The question I wanted to answer was this: will we gain any additional valuable information about teams by differentiating different ‘types’ of motifs according to their angles, distances and coordinates on the pitch? Crucially, I also wanted to know where the right balance would be when doing this differentiation in light of our structure-detail trade-off.

There are two ways of looking at spatial variables associated to motifs that I felt could be revealing:

x-y Coordinates of Passes: In Opta’s data files, each pass has a ‘Start x-y’ Coordinate and an ‘End x-y’ Coordinate, meaning each pass has four variables in terms of coordinates. A 3-pass long motif would therefore have a set of 12 variables representing where its passes began and ended.

Angles and Lengths: Another way of looking at it is by the ‘angles’ and ‘lengths’ of passes in a motif. The figure below illustrates how these are found.

With this idea we would have six variables associated with a motif: the angle of each of the 3 passes of the motif and the length of each of the 3 passes as well.

NOTE: The thing I like about this ‘angles+lengths’ idea is that it doesn’t “care” where in the pitch a motif happened, only its geometric structure. I like this because if it has ‘structure’ or ‘insight’ into teams’ styles it will not be as heavily determined by whether the team dominates the opposition or not: if we only look at pitch coordinates of motifs then top teams like Chelsea or Manchester City will perform all of their motifs high up the pitch. Therefore, the method would be biased towards saying they perform the same ‘types’ of motifs, namely “high up the pitch” motifs. I’m not saying that this isn’t meaningful, but it is information we all know by simply looking at the league table and knowing these teams play deeper into their opposition’s half. However, if we discover structure that is independent of the league table from the geometric shape of motifs, it makes it interesting in the sense that perhaps it wasn’t correlated with "obvious" aspects.

Whichever of the two ideas we go for, there is going to be a set of variables associated to each motif, which we can then use k-means clustering to classify into a certain number of different types of motifs. Our intuition from the trade-off tells us that there is an intermediate number of categories that has the best representation of “style”. The problem is that to use a k-means clustering algorithm, we need to manually tell the algorithm how many different categories we want before knowing this optimum number.

Consider this: for each choice of number of categories, once we have determined the number of categories and classified the different motifs into the category they correspond to, we can use the best practice we know from the original passing motif methodology and look at what percentage of each motif category (in the ABAC-sense) corresponds to which ‘type’ (in a either a x-y coordinate or angles+length sense). So as an example, if we had chosen to have 3 different types of motifs, then for each team we would have this set of numbers: what percentage of the teams ABAB motifs are type 1, what percentage of the ABABs are type 2, what percentage of the ABABs are type 3, what percentage of the teams ABAC motifs are type 1, what percentage of the ABACs are type 2, etc. What we’ll have is a vector representing each team.

Now suppose we randomly divide each team’s motifs into two different sets, so now we have Arsenal’s A motifs and Arsenal’s B motifs as if we were artificially considering each as the motifs of different teams. If choosing this number of categories reveals teams’ structure or style, then the style attributed to Arsenal’s A vector should be very similar to the style attributed to Arsenal’s B vector. The more underlying structure we’re capturing, the more this effect should be obvious. If on the other hand we’ve gone too far and now the extreme detail is overshooting the underlying structure we want to discover, then Arsenal’s A vector will not necessarily be similar to Arsenal’s B vector because the extreme detail is damaging the comparability of styles. This is what I mean by “repeatability”.

The following graph reveals how “repeatable” each choice of number of categories is for both the ‘x-y coordinates’ idea and the ‘angles+lengths’ idea:

The methodology is as follows: for each number from 2 to 50, we create that number of motif categories using k-means clustering and assign each motif to a category. We then divide randomly each team’s motifs into two different sets to have a vector A and vector B for each team. Then we check how “repeatable” the methodology by checking on average how close teams’ A vector is to their B vector in comparison to the rest of vectors representing other teams; and this process (since it involves both the randomness of a k-means algorithm and the division of a teams motifs into two sets) is repeated a hundred times for each number. The graph shows as a percentage the average ‘relative closeness’ for each of the hundred trials as follows:  I took each teams’ two vectors and determined on a scale of 1 to 39 how close a team’s A vector was to his B vector. Since there are 20 teams and I divided each one into two vectors, there will be a total of 40 vectors representing 'styles'. Considering as a focal point a team’s A vector, its B vector could either be the closest of the other 39 vectors (1), the second closest (2), all the way up to the farthest away (39). I did this for every team and averaged these numbers, to finally compute the percentage that the outcome was of 39 (this was done using passing data for the 2015-16 Premier League season).

Right off the bat we find evidence of the balance we’ve been speaking of. When we start increasing the number of categories we start obtaining more repeatability, meaning we can more closely recognise two vectors as being the A and B vectors of the same team because they are similar (i.e. close) to each other. I like to interpret this as uncovering more underlying information that uniquely identifies a team’s passing network style: no matter how we randomly divide a team’s motifs into two sets, we roughly still know which sets correspond to the same team because we know the “style”. We then reach an optimum number of categories for which this repeatability is optimised: for the ‘x-y coordinates’ idea it’s at 9 and for the ‘angles+lengths’ idea it’s at 13. After this, the repeatability starts to decrease meaning that a team’s A and B vectors start to not be as similar to each other because they’re made up of highly detailed motifs that are overshooting the underlying “style” of what it actually is that a team inherently does with its passing networks.

We have answered our initial question: The original passing motif methodology (found in this entry) in which we took the 5 different motifs and compared teams according to how much they relatively used each motif had about 83.3% repeatability as per our methodology. By breaking motifs down into an optimum number of categories for the ‘x-y coordinates’ and ‘angles+lengths’ ideas (9 and 13 respectively), we were able to increase our repeatability to 94.3% and 84.4% respectively (evidently the 'x,y Coordinates' has better repeatability than 'Angles+Lengths', but as we said before any structure from a purely geometrical classification is interesting).

Below is a set of boxplots illustrating what the 9 categories represent in terms of the different 'x-y' coordinates:

As an illustration, if you look at categories 4 and 8, they both begin a bit past the halfway line really close to the left touchline, but while in Category 4 motifs the three-pass sequence ends a bit further up but still on the left hand touchline, the Category 8 motifs made their way across the pitch to finish closer to the right hand touchline.

The 94.3% repeatability of the 'x-y coordinates 9 category' vectorisation is incredibly high. In fact, if we remove Sunderland and West Bromwich which for some reason only have 80.2% and 83.5% repeatability respectively, the other teams have an average repeatability of 95.7%!

These results mean that we’ve managed to pin down an underlying structure in teams’ passing networks that allow us to identify unique team styles (lets call it "Passing Network Autographs") with a high degree of confidence. We’re at the point where if we’re given a set of motifs we could have a robust educated guess at which team they correspond to and most likely be right (except perhaps if they belong to West Brom or Sunderland for some reason). As an example, below is a comparison of Arsenal's autograph versus Bournemouth's (the team whose 'autograph' most differs from Arsenal's):

Perhaps some readers might be unimpressed with this rather theoretical and un-applied result, but although I admit that in its raw form this seems a bit unmanageable, I would advise them to keep an open mind and think of the potential. For example, having such a reliable ‘passing network autograph’ for teams, we can look through players from outside the Premier League and find those whose current passing network best fits within a team’s autograph. We could also use our measure of team style to try and predict which styles are more effective against each other, or which defenses are the best at interrupting the attacking flow through a team’s passing network. These possibilities probably sound more appealing to most readers, but in order to do them in a meaningful way they must be underpinned by theoretical confidence that we are indeed identifying team styles. I will try and follow up this theoretical entry with a more applied one exploring some of these possibilities later this month. 

I want to finish this entry off by highlighting the important potential of generalisation these ideas have. I feel they’ve helped me establish best practice when it comes to breaking passing motifs into different categories according to their spatial properties (and by best practice I mean knowing how many categories I should break it up into); but the method can also be used to determine best practice in other ideas currently being explored by football analysts. For example, during my Opta Forum Presentation’s Q&A, Marek Kwiatkowski asked whether the passing motif methodology could be generalised to motifs of more than 3 passes. The answer is that it can, but we run the risk of going too far and start overshooting the structure that the methodology helped us identify as team and player passing style: for 3-pass long motifs we had 5 motif types, while only going up to 5 or 6-pass long motifs we’re already at 52 and 203 types respectively with wacky things like ABCADBA. The ideas presented here can help us answer the question whether it’s worth looking at longer motifs (another entry soon perhaps?). It can also help Dustin Ward to establish exactly how many types of passes he should consider. In general, it helps us to establish standardised best practice that the whole of football analytics will benefit from and that its currently distinctly lacking. Echoing Marek’s piece on the state of analytics: “Established scientific disciplines rely on abstract concepts to organise their discoveries and provide a language in which conjectures can be stated, arguments conducted and findings related to each other. We lack this kind of language for football analytics”. We need common-ground theory in which our public work can be related and compared, and it’s worth truly understood. The lack of it is holding back all of us who have an active interest in the field really taking off. I hope this approach to improve our understanding of our ideas and take steps towards enhancing them and establishing best practice can inspire other public (and even private) analysts to attempt similar things in their work and establish bridges through which we can compare and complement our work. Valuable applications will inevitably flow from robust, interconnected theory.

MATHEMATICAL FOOT-NOTE:  Comparing the distance between A and B vectors as their position from 1 to 39 on the closest-farthest away scale may seem a bit unorthodox and one might consider simply using the z-score of the distance between teams’ A and B vectors in the context of all the distances between all 40 vectors. However, the reason I don’t do this is that for each different choice of the number of categories, the dimension in which these vectors are is different, and on a personal mathematical note I have deep mistrust in comparing distances between things that are in different dimensions.

P.S.: I want to give a brief mention to BenTorvaney who gave me a small but meaningful contribution which I feel greatly enhanced the results of this entry.

Tuesday, 14 March 2017

VIDEO: OptaPro Forum 2017 Talk on Passing Motifs

A talk on some of my passing motifs work was selected for the OptaPro Forum 2017 which took place in Central London on February 8th. You can see the full video (except the Q&A) below:

Most of the content I'd already written entries about: For the general overview on the passing motifs methodology for teams read this entry. For my applied results on teams from the Premier League (and a take on what the hell happened with Leicester) read this and this entry. Below are the images for the hierarchical clustering dendrograms and the principal component analysis graphs of this 5-dimensional representations of Premier League teams for the 2014-15 and 2015-16 seasons.

Moving on to a player level, you can read up on the general methodology in this entry, and then on the specific scoring system that I presented at the forum to create those lists of players in this entry. Below are some of the lists I had on display which are respectively: Premier League 15-16 using 'Key Passes' to award points, Bundesliga 15-16 using 'Key Passes' to award points and finally Premier League using 'Expected Assists (xA)' to award points. For that last one I used Opta's xA numbers which give account for the probability of a pass turning into a shot with a certain xG value.

Finally, I also did a bit on using Topological Data Analysis (TDA) to explore the results for players which I hadn't done before; although to read up on the general methodology of TDA you can read this entry (wow how things have changed since that entry! I of course now know that Opta doesn't really log 'controls with left thigh'. Don't be fooled by how assured I wrote about the analytics industry back then, I honestly didn't know half of what I know now about that world just 10 months later... hopefully my future self in another 10 months will also look back with pity at my current self's ignorance).

Below is the image from the forum:

Finally, I want to use this (non) write-up on the presentation as a platform to discuss some more general reflections about analytics. 'Operationalising' is a hideous sounding word which was horribly difficult to repeatedly say in front of 300 people; but it actually is very important. There is so much complexity in raw football data that those of us who do analytics really need to broaden our scope when thinking how we will represent this raw information in numbers, vectors or variables that will help us uncover the rich underlying information that is there for the taking. The 'passing motif' operationalisation of raw passing network data is 'neat'; it seems harmless when you first see it and I wouldn't blame you for doubting that those 5/45 numbers attributed to teams/players will actually say much about them, but evidently they do. I think that what it's got going for it is that it helps to account for the sequentiality of the raw events, something which most of the work I encounter out there fails to do. As I said in the presentation, we're a bit too focused on events when its actually the sequences of these events that actually matter.

There is a classic problem though (akin to the overfitting problem of modelling) when trying to account for larger and larger sequences: If we become too granular and for example don't do this methodology for 3-pass long motifs but rather for 10-pass long motifs; then the occurrences will become so specific that we actually lose out on comparable capacity in the structure of our information. Alexis would have such a specific distribution of highly differentiated sequences that he would have no neighbours to reward him for their key passes! We need to strike a balance between sequentiality and lets call it "non-granularity" (this was actually one of the questions at the forum: can/should the methodology be generalised for more than 3 passes?).

Finding the correct concepts that strike this balance is the challenge of analytics. Passing motifs are "neat"; but even I recognise that they are nowhere near the ambitions of what I would hope to achieve in analytics. Exciting years to come!

Monday, 16 January 2017

Player Vectorised Representations: What player lists can we draw up with confidence?

I love drawing up lists and rankings of players (who doesn’t?) and giving myself a big “confirmation bias” pat on the back when I see players on the list which I like while casually either ignoring as a mistake of the method or updating my bias for the players on the list which I don’t particularly rate highly. However, the very exercise of drawing up lists and rankings can be misleading for the probabilistically-illiterate because it seems to imply set-in-stone certainty about who the best player is, who the second best is, etc.; and this rigid numbering masks the underlying concepts of probability. And yet, drawing up player lists is key for the recruitment workflows of clubs, be it in drafts or transfer windows, or even just to set up a schedule for their scouts. You definitely don’t have to see the rankings as set in stone, but I can imagine clubs would definitely want to have things like 15-men shortlists with 2 or 3 ‘favourite choices’. In this entry I’m going to show you a couple of lists I drew up and how we can go about our list-making with confidence with vectorised representations of players.

I drew up lists for this entry using the player passing motif ideas from previous entries. The passing motif methodology produces a vectorised representation of players, which basically means that each player is represented by a vector of numbers. In the passing motif methodology I’ve used so far, the vector representing each player has 45 entries or numbers. The key conceptual bit is that when you have this type of vectorised representation, you can imagine each player as being in a “space” of some sort. To imagine it, suppose that instead of 45 you simply had 3 numbers representing each player, something like age, height and weight. If this was the case, you could imagine each player as being represented by a dot in a 3-dimensional space much like your living room. Some players would be closer to others, some would be farther away. Perhaps all the senior, tall and heavy centre-backs are located around your TV, while the shorter and lighter second strikers are hovering around your dining room table. This is just how I conceptualise the result of the 45-dimensional passing motif methodology. It makes it more abstract to picture, but just as in the 3-dimensional case, there are distances, certain dots closer to each other or concentrated around certain areas, etc.

The list I drew up basically took all the players who had at least 18 appearances in last year’s Premier League, and gave them “points” according to how many key passes they made AND how many key passes the players around them made. The closer to a player you are, the more “points” his amount of key passes awards you; the farther away the less. I tried this out in a few ways but that’s the basic idea. The idea is that if you happened to make few key passes in the season but all the players whose motif vector is close to yours made a lot, you should still have a high score. If the information contained in the motif vectorisation is at all useful to recognise players with creative potential, then the best scored players should in a way be the best creative passers in a more profound way than simply looking at the Key Passes Table. The question is precisely, how do we know the vectorisation’s layout of players has anything to do with their “key passing ability” (i.e., players with high ability cluster around certain areas of this “space” and are in general closer to each other)? Let’s look at the list before we begin to answer this question so everybody gets a bit excited before it dawns on them that I’m actually rambling on about some technical stuff.

Remark: Notice how this list isn’t strictly correlated with key passes. Drinkwater is better ranked than Eriksen even though the latter had many more key passes. This means that if the list is sound (big if), its picking up on information that wasn’t immediately and explicitly available in the key passes tables.

My confirmation bias seems to like that list quite a lot, there are a lot of good names up there. Most readers probably follow the Premier League closely and know that those are all good attacking creative players, arguably the best in the league. Now imagine that instead of the Premier League, we drew up an equivalent list using data from leagues where we didn’t already know the players, and had confidence that just as in the case of the Premier League, we were definitely getting out a list of most of the best players. Should be useful huh?

There are also some notable absentees. Coutinho comes to mind as a player which is widely agreed to be amongst the best in the league who isn’t on the list. Why should we trust a list that claims to rank the top 15 creative players in the league but leaves out Coutinho?

As I said before, I think of the vectorised representation as encoding the information regarding players’ key passing ability if players who tend to have a higher number of key passes are more or less clustered together as opposed to randomly located mixed with all the other players. If this is a general trend, then we know that there is a relationship between a player’s key passing ability and his location in the 45-dimensional space we are imagining. Even if a player happened to not have many key passes in a season (this can happen just as strikers have goal droughts or perhaps because a player’s teammates don’t make good runs), we should still pick up on this “ability”.

What we would need then to justify our faith in the list is some sort of indicator which specified just how “clustered together” players with higher number of key passes are. There are many ways to approach this problem in mathematics. For those readers who have mathematical backgrounds we could try to fit a model and asses the goodness of fit, or apply some sort of multi-dimensional Kolmogorov-Smirnov technique comparing the actual distribution of vectors and key passes with one where the key passes where distributed randomly. However, all these tests are a bit technical and hard to apply in high dimensions, and all in all we really want an indicator more than a model of “Expected Key Passes”. Here’s a simpler validating technique:

For each player, take his K (in mi list K=10) closest neighbours and compute the standard deviation of their key passes. Once we’ve done this for every player, we can compute the mean of the standard deviation of key passes in each of these K-player “neighbourhoods” (let’s call it the ‘mean of neighbourhoods’ variation number=MNV’). If in each neighbourhood the players have a relatively similar number of key passes, then the MNV should be comparatively low. The important question is: what do we compare it to in order to know if its low or not?

 I feel that there are two important numbers to compare this number to. The first would be simulating many (many) scenarios where the key passes are randomly permutated amongst the players and comparing the real MNV number to the average of these simulated cases. The second number would be the minimum MNV of any of the simulated scenarios. If the MNV of our actual vectorised representation is “low” in comparison to these simulated scenarios, then we know that the players’ layout in this imaginary 45-dimensional space clusters the key passers of the ball closer together than random distributions; which in turn would mean that the logic applied to obtain the list has a robust underlying reasoning because a player’s location in the 45-dimensional space should have something to do with his “key passing ability” (I fear I may have lost half the readers by this point…).

Here are some results:

Of the 100,000 simulations, the lowest MNV was 14.62 while the actual MNV is 11.86. This means that if we randomly assigned the players to a position in the 45-dimensional space 100,000 times, none of those simulations has the key passers clustered together better than our actual passing motif representation. This is quite promising, but even then, I suspected that maybe this is because the method clearly recognises the difference between defensive players and attacking players and attacking players are much more likely to get more key passes; so I repeated the validation using only attacking midfielders, wingers and strikers:

The results are less overwhelmingly positive, but even when just looking at attacking players, the actual layout surpasses any random distributions of the players after 100,000 simulations. To appreciate the value of this method and what information this is actually giving us, let’s compare with an equivalent list drawn up using ‘goals’ to award points rather than ‘key passes’ (using only attacking players again for the same reason as before).

The MNV numbers are naturally smaller because players score much less goals per season than key passes, so the overall scale of the problem is smaller. We can see that even though the real MNV is smaller than the average of the simulations, its actually relatively large when compared to the minimum MNV obtained through random simulations (notice how important it is to have a frame of reference to know when the number is small and when it is large in each specific context). This means that the position of players and goals in the 45-dimensional space can be clustered together through random simulations considerably better than using the passing motif vectorised representation. As opposed to the ‘key passes’ case, this vectorisation doesn’t encode much information pertaining to “goalscoring capability”. This actually makes sense though since the passing motif methodology is designed using only information from the passing network which doesn’t necessarily contain information regarding finishing. Therefore, the list made using ‘goals’ is much less reliable.

Coming back to Coutinho’s absence from the original list, it’s important to understand that I’m not claiming the list as a know-it-all oracle for creative talent and that this talent can be rigidly ordered. What this entry tried to show is that there is solid evidence that a player’s position in the 45-dimensional space determined by the passing motif methodology encodes a good amount of information which determines how many key passes he ought to have given a sort of “passing ability”. That doesn’t mean it encodes all the information. Perhaps the vectorised representation is missing out on what it is that makes Coutinho great. Nevertheless, once we’ve accepted and understood that the list will offer us, I doubt any club could claim that a list like this from different leagues from around the world is of no use to their organisation just because they might miss out on the Serbian League’s Coutinho (sadly, such is the ‘glass half empty’ prejudice that analytics face).

Finally, this way of looking at the problem of rating players opens the door to a host of possibilities. When I was doing my bachelor in pure mathematics I was actually more interested in differential geometry and topology courses than statistics courses, which is why I tend to think of data observations as vectors in high-dimensional spaces and think that their positions in those spaces encodes valuable information. This entry began by taking a vectorised representation (passing motifs vectors) and established that if we look at the number of key passes each player made, the players’ vectors’ position in this space seemed to encode this info. On the other hand, it didn’t seem to encode the information pertaining to goalscoring. That isn’t to say it might not encode information regarding other metrics. Expected Assists maybe? It also doesn’t mean that other vector representations don’t encode some of this information better than my own passing motifs representation. It’s a bit of a 3 step thing really: 1. Find a vector representation, 2. Check what sort of information it seems to encode well (especially information that isn’t explicitly available elsewhere, and 3. Find a way to give players a rating using this fact.

I hope this way of thinking encourages other analysts out there to try their hand at this sort of work!