How a Former Minor League Baseball Player and a Neuroscience Student Are Redefining What Happens on the Soccer Field
Well, "some" of what happens. They still think Lionel Messi is way better than everyone else.
Black lives matter. This document has an exhaustive list of places you can donate. It’s also got an incredible library of black literature and anti-racist texts. Donate, read, call, and email your representatives. We’re all in this together.
Sam Goldberg had just been laid off by the Chicago Cubs due to the coronavirus pandemic. Michael Imburgio was working on his PhD in Cognition & Cognitive Neuroscience. They’d never met, never even heard of each other -- until Sam reached out after seeing something Michael wrote for the site American Soccer Analysis. A couple months later, they’d still never met and hadn’t even exchanged a phone call, but hundreds of emails back and forth produced a model that makes some bold claims about the soccer world: We can tell how valuable your players are; all you need is publicly available data. Oh, and you’re thinking about positions all wrong, too.
Meet “Determining Added Value of Individual Effectiveness including Style”, or rather: DAVIES. You already know about Goals Added -- and if you don’t, go read the interview I did with John Muller about the model built by American Soccer Analysis that figures out the value of every action a player takes, adds them up, and determines -- you guessed it -- how many goals a player adds to his team’s performance. Equipped with that info, Goldberg and Imburgio tried to create a model that closely mirrored the results of Goals Added but only used easily accessible data. They wanted their model to be useful to clubs that can’t afford more expensive data packages -- and they wanted people like you and me to be able to access it, too. Which we can; it’s all right here.
Given that it builds on the impressive achievements of Goals Added, DAVIES is the best, publicly available player-value model I’ve seen. In addition to that, it also re-imagines the idea of positions on a soccer field. Based on their actions, the model sorts players into their roles within a team, rather than their positions. Central defenders are “Possession Oriented Defenders” or “Low Block-Defenders”. Deeper wide players are “Offensive Wide Progressors” or “Defensive Wide Progressors”. Central midfielders become “Offensive Central Progressors” and “Defensive Central Progressors”. And attackers get sorted into three buckets: “Finishers”, “Dribblers”, and “Playmakers”. The distinctions are useful for the model, and players get compared to other players with similar roles, but like any other notable analytical achievement, it can also slightly shift the way you look at the sport. It certainly did that for me.
Goldberg is a former minor-league baseball player who spent time as an analyst with DC United before joining up with the Cubs, while Imburgio told me that, as a kid, he figured out a formula to convert college basketball stats into player-ratings in the video game NBA Live so he could accurately recreate each year’s real-life draft prospects. The two of them hadn’t ever spoken on the phone until the three of us chatted recently. The convo below has been condensed and edited for clarity.
If you're talking to someone who is a big soccer fan who isn't a data person—clustering and regressions mean nothing to them—how would you describe the DAVIES project?
Sam Goldberg: Being able to communicate the results of a project to someone that has no math background whatsoever, which is 99.9 percent of soccer coaches, is way more important than the actual project, but I would explain it as: it's a player evaluation metric that accounts for a player's age and their style of play. So it predicts a metric called Goals Added, which is an overall value of how many goals a player adds to their team over the course of a season. And then adjusts it based on similar players by their play style and their age. So you don't want to compare Lionel Messi to Edin Dzeko. Those are two different styles of players and they should be measured differently against each other.
Michael Imburgio: Yeah, I think that's pretty much the main point. A guiding principle of a lot of what we did was trying to build a way to estimate player value but also making it accessible. So we built on the really incredible work that a lot of the other people at American Soccer Analysis did in developing Goals Added, which is this all encompassing measure of being able to compare the number of goals that a player adds. In order to actually calculate that, you need event-level data, and it's a pretty complicated calculation. So we wanted to do that, but simplified. In a way that you could just take publicly available data and do the same thing. Or as close as you could get.
How did you decide what sort of things to count, and how did you decide or figure out how much all of these things counted for? Because what you're actually doing, right, is you're determining every action a player takes and how much it increases the likelihood of his team scoring, and then essentially those things basically get added together. How did you determine which things were worth what?
MI: First of all, a lot of this was built on the fact that we had an actual Goals Added number for MLS players because of what ASA did. So how we actually figured out what mattered was probably the most complicated part of all of this. We tried to figure out the best way to estimate the Goals Added number from more simple statistics. And then we looked at what actions it thought mattered, and we used those actions in a more simple way. And then trying to figure out how much those actions count for is really not something we have to do manually. It's more that we know what actions we want to use after that first part, and we let the computer weight them and figure out like, “Okay, number of touches in the box matters this much to actually estimate a real Goals Added,” and then we used those weights when we look at European players who do the same thing.
SG: On top of that, we needed—I think one thing that's common in competitions and publishing papers and such is people always try to build the best model possible,—but we published the results of a model that is not as good as it could have been, which I think is really important to note, because we needed it to be applicable to multiple data sources.
You guys include penalties attempted and penalties won in your model. But penalties attempted seems like something chosen by your manager. So what's the thinking behind that specific input and including that in the player-value model?
MI: Let's take Cristiano Ronaldo's last season. He had like 30 goals, but like 12 of them were from penalties or something like that. Those are rough estimates. By including the penalties attempted in the model, basically the model is able to tell, OK, a lot of his expected goal tally came from penalties, and so those expected goals should be weighted less.
SG: I think also, what made intuitive sense to me, is if you look at elite players in the world, elite forwards are always going to have the highest value. The way I explain it to American football fans is like your skill positions are always going to be paid higher than your non-skill positions because they have to be more precise, so forwards are always going to score higher on these metrics because their preciseness is rewarded. So the way I thought about it is elite players on their teams—it's not that they're chosen by their manager, it's more so that they're the best option to take a penalty.
The same problem with me comes with how I always want to reward players for playing minutes. So I'm not a fan of per-90 stats at all. I think it's good when you're measuring players who have played a lower threshold minutes against each other, but somebody, a high level MLS front office employee, once said to me, “If you play 2000 minutes in Brazil's Serie A, that probably means you're a good player and we don't really need to look at your other metrics”. And while I don't agree with that, the sentiment about playing minutes remains the same, and I think it's the same about taking penalties. If you take a lot of penalties, that should be included in the model because that's an important part of the game. How many penalties are there in a game? It must be—at least this rate—
Well now there's like 10 per game. You might need to change the model.
SG: Right, for obvious reasons I don't want to talk about it. But it's an important part of the game that does change the game, so I think it needed to be included in there. They're not weighted less than a normal goal, but the way that it's measured changes.
Did you guys consider doing anything like adjusting for possession and the values the players provide? Does a guy on, say, Watford, who had 40 percent possession or whatever, is he disadvantaged by this compared to a guy who's on a better team who has the ball more often and therefore provides more opportunities to do the things that go into the model?
MI: Keep in mind that not everything in the model helps people's ratings. So there's some stuff that like, yeah, they're doing a lot of it, but it might be hurting them rather than helping them. So for example, to bring up Ronaldo again, because we did look into him a lot. He has a lot of possession in the attacking third, but relative to that number of touches, he doesn't have a ton in the penalty area. And so that actually causes his rating to go down a little bit, because he's doing more things but they're not necessarily valuable because they're farther away from goal. That applies to a lot of metrics in the model. So it's not necessarily just like more stuff means better. It's sort of kind of a ratio like that in a lot of different ways. [Expected assists] and Key Passes, too. If your key passes are generally high in xA, your model rating is usually better. But if you're playing a lot of key passes where people aren't getting good shots off of them, they're just garbage shots, like at the top of the box with a bunch of people around or whatever, key passes might be hurting your model rating.
SG: I have very strong feelings on possession adjusting, for whatever reason. I was all about it at first, and then the more I thought about it, Mike and I have probably talked about this for 50-plus emails. For me, it's no guarantee that if you were on a team that has less possession and are now on a team that has more possession, you become a good player. Because there are teams where players fit the style dependent on the fact that they have less possession. So now let's say you're a counterattacking-based team with really super fast wingers, that team thrives on not having the ball. But now if you adjust possession for them, that's not a true reflection of how someone might play if they were possession dominant, because then the defense has to drop a little bit deeper, and there's not as much space for wingers to run into. And so possession adjusting these stats is a detriment to the actual model itself because it's not a true reflection of what actually happened.
What surprised you most as you were kind of going through the process of building the model?
MI: Player value models often have a hard time dealing with defenders. Like very often they're biased toward attackers, and I think ours probably is a bit, too. And so is Goals Added, which ours is based on. And that's just a common thing for a lot of player value models. But what was cool was when we started making these tables of top 10 youth players, they weren't all attackers. Trent Alexander-Arnold does get recognized as like a top-three youth player from last season, as obviously he should be, but it's nice that the model learned enough to be able to recognize that a fullback's contribution can be as big as a striker's contribution. And then also Dwight McNeil is in the top 10 from last season's youth players, which I think is cool because his team didn't have a ton of the ball, and his value still came out. The model was able to basically overcome possible biases against defenders and against teams that aren't your classic top tier, to be able to recognize those diamonds in the rough.
SG: Also, the thing I think is worth noting is that these models are skewed towards attackers for good reason. Like it's important that they’re skewed toward attackers in the same way that we spoke about earlier with the skill positions getting paid more. You have to be more precise in scoring a goal than you have to do in clearing a ball. And so the fact of the matter is there's a lot of things that go into being a good defender, positioning, etc, that are very hard to measure without tracking data. I think it is impossible to measure without tracking data. Like looking at minutes played for defenders is a strong predictor of how good they actually are. But these models skew toward attackers because that's what wins games. Scoring goals. And preventing goals is not as important as actually scoring them. And it's the same in the NFL, the same in Major League Baseball, it's the same in the NBA. That's the rules of the game and that's how it should be measured.
Is there a hierarchy among play styles? There's playmaker, dribbler, and finisher areas the three main attacking groups. Is there a hierarchy in terms of the raw value added among those positions?
SG: Yeah, I think the order went finisher, playmaker, dribbler. To me, intuitively that also made sense. Finishers are going to get rewarded for scoring the most goals. Playmakers are going to get rewarded through xA, mostly, and if you think about it, just also intuitively—it sounds so basic, but most people just don't grasp the theory that the ball moves faster than the player. When you pass the ball, more things can happen than if you dribble it. And so even if that pass is not successful, necessarily, it still creates chances, whereas a dribble, you either have an outcome of one or zero, and then have to pass the ball or shoot the ball. There's another action that has to happen afterwards than just passing, and so I think playmakers get rewarded higher than dribblers. But that's not to say that dribblers don't serve an important purpose on a team.
Is there anything new about the sport of soccer that you learned based on this process? Any kind of new knowledge you now have about how the game works?
MI: I guess the stuff that I would say I learned most here is that it is possible to use these more "basic" stats to estimate value in a way that makes sense for all kinds of players. I think going into this I would've expected much more of a harder time with players that don't score a lot or aren't close to the goal very much. Learning that you can do this for defenders and midfielders was new to me.
SG: I think that's spot on. We would be nowhere without Goals Added. What they did was incredible. In my eyes, that's the first Wins Above Replacement for soccer that exists. I think for me, it was less so learning about the game and creating a tangible tool that teams can use to identify players, and understand. I've worked with front offices before that have straight up looked at me and said data has absolutely no place in measuring players. And so I wanted to create something that was basically a little bit of a screw you to that, because it's very hard to argue against Messi being the no. 1 player in the world. That's not—it's an indisputable fact, and this shows it. And so by including European leagues, like the Big 5 leagues, how can you argue against this?
Sam, how different in terms of data fluency and receptiveness toward numbers is the soccer world compared to the baseball world?
SG: “It's completely dependent” is the real answer. We'll put it this way. In the baseball world, baseball analytics is a billions and billions of dollars industry. Most teams use more analytics than most soccer people could even imagine. Whereas in soccer analytics, there are some teams that are getting to the point of baseball analytics, and there are some teams who operate with no data whatsoever. And to me with the data that's available in the world at the price it's available, it's kind of unacceptable to be operating without data. But then again, that's a choice that's made and we kind of have to deal with it. But baseball analytics is years ahead of soccer analytics and the fluency in baseball analytics is a lot better than it is in soccer analytics. But I don't want to take away from the soccer analytics world at all because baseball analytics is way easier to understand because it's an individual game. Baseball is not a team sport. It's an individual sport masquerading as a team sport. I think Mike said that in one of our conversations and it's true. Nothing really changes. Whereas in soccer, everything is dependent on the team. And so it's very hard to measure.
What kind of insights other sports have helped you guys in as you’ve applied some kind of analytical framework for how you're looking at soccer? Is there anything that's helped inform the way you look at soccer that you can pull from other sports?
MI: There's a lot. I'm a big basketball fan, and basketball analytics are obviously blowing up in the past 5-10 years or so. A lot of the stuff that basketball analytics does can be applied to soccer. It's just much harder for a lot of the reasons that Sam brought up. Basketball has short defined possessions, a much higher scoring rate, so it's much easier to tell what's good and what's bad, and so it's easier to do in basketball—although still probably extremely difficult. And so for example Sam and I were talking about defining play styles, and people do this in basketball, and I think you can kind of follow suit in the way we do that. You can define a player as a 3-point shooter, or a big, or a rim defender in basketball, and being able to do something like that in soccer would open a lot of doors and is something I personally have a lot of interest in.
SG: Sports analytics is a lot like measuring the universe. You know what exists and you can devise a framework by what you know exists, but there's also stuff you know doesn't exist. In all sports analytics, there's three steps in terms of approach that a sport can take. The first is always player evaluation because it's the easiest to do, and that's what Moneyball did. They looked at walks. The next is opposition scouting and analysis through data, which then it turned into the shift. And then the third and final stage of what we currently know is player development, which we're in the midst of right now in baseball. Like I said, player development is a billion-dollar-plus industry with the amount of tech that's included. I can't speak a lot to what specific teams do on that, but my college team alone has over $10,000 in tech, and we were a small division-III program.
I played division-I soccer and I didn't look at a stat once in four years.
SG: Right, exactly. Right now, soccer is pretty good in the player evaluation phase. Pretty damn good in the opposition analysis stage, and almost nonexistent in the player development stage. What's weird about soccer in my eyes, opposition scouting, step 2, is ahead of step 1. Like teams can accurately scout other teams more efficiently than they can find players. And so that's something just worth noting. But those are the three stages, I would say. Stages 1 and 2 are definitely underway with a lot more improvement that will happen, because it's a never-ending cycle, but step 3 is a long way to go. There are no soccer players looking at slo-mo of how they kick the ball and determining body orientation depending on various metrics. Like it just doesn't happen.
The idea of shot selection in soccer just feels like an obvious thing for strikers and attacking players to work on. Midfielders shooting from 25 yards even if you have space is probably not the right decision, so there do seem to be some kind of pretty actionable principles that analytics can provide.
SG: A video came out that kind of made analytics Twitter go wild with, I think it was the Dallas Mavericks shooting from expected points value spots on the ground, and I floated it around with an MLS team. You can actually draw arcs on the field with expected goal values, which is a better illustrator than saying, “Hey, if you shoot from here, you only have a 10 percent chance of scoring. Maybe let's not hit that all the time.” Granted, that then becomes a man-management tool where you can say to your really good players, “Hey if you feel comfortable, we think your value is actually higher than that, and now that player feels really happy about the fact that you put more value in him than you do in numbers.” Whereas other players, you can say, “Hey look, it's not a really good chance,” and they'll probably understand it, and they'll probably understand why.
I'd like to see whether there is an analysis that could predict on a team level whether adding a "X" type player to a forward line with a "Y" type player is a good combination... like adding Griezmann to Barcelona or Luis Suarez to Atelti or Thiago to LIverpool... a predictor based on the right combination of player types.