Win Probability Analysis part 1: Mean Win Probability

Win Probability Analysis part 1: Mean Win Probability
Win Probability Analysis part 2: Win Probability Added
Win Probability Analysis part 3: MWP Modified
Check out the posts of R code after each release!
Link at bottom, along with more nflscrapR and RStudio guides.


Context:
This project is about the Win Probability (WP) metric from the nflscrapR package in RStudio. This is a logistic regression that predicts win probability given some in game statistics. The data, collected by the nflscrapR, is from the NFL. All I did is average the WP calculation for every point in every game, and threw it on some plots. Did I invent Mean Win Probability? Not even close. Am I the first to consider it? Very unlikely. Despite that, I have not seen anything about it except here (somewhat):
https://operations.nfl.com/stats-central/stats-articles/win-probability-models-for-every-nfl-team-in-2019/

Here is some literature about the WP calculations:
https://arxiv.org/pdf/1802.00998.pdf
https://statsbylopez.com/2017/03/08/all-win-probability-models-are-wrong-some-are-useful/amp/?__twitter_impression=true

I am somewhat surprised to not see Mean Win Probability (MWP) discussed anywhere online despite my research. I hope it can be a part of the analytic toolbox, and more in depth analysis about WP comes about.


My Work:
I simply averaged the WP estimation value out for each team in each game. This gives a "win probability" for that team in that game. Summing this over a season gives an expected wins. This isn't exactly their "win probability" for the game, since that is highly dependent on what they and their opponent have done throughout the game, on top of the random events that determine the outcome too. Instead, I interpret it as more of how a team "dominated" the game, or how likely they were to win that game at an average event of the game.


Motivation:
Defense-adjusted Value Over Average (DVOA) and Point Differential do not consider win probability, thus weighting every score and drive equally. These are common metrics for win prediction though: DVOA is used for season wins predictions on FootballOutsiders.com; Point Differential is used in a formula for a season win expectation metric called Pythagorean Wins. These do not consider when clock management has heightened importance for coach decision making, or when scoring events and explosive plays are less important for winning. On defense, teams are more worried about quick scoring and red zone defense when having a large lead. They play softer coverage allowing the opponent to make easy but clock consuming drives.
We do not have clear theoretical ideas of how often or which games this concept has impact. Teams do not prioritize maximizing DVOA or Point Differential, but rather win probability. If you follow @LeeSharpeNFL and his WP charts, you can see how many field goals throughout a game have a approximately +0% in WP. A touchdown in a blowout can has a reduced effect than in a one score game as well.
An example: https://twitter.com/LeeSharpeNFL/status/1216156038518530053?s=20
Those things impact DVOA and Point Diff (Pythagorean wins) regardless of game situation. This always bothered me when a season win expectation can be changed because a team gave up basically a meaningless FG. There is though, explored in this project, a way to determine the importance of game situations to winning, and how to estimate that importance.
Another motive related to Point Diff has to do with the interpretation of one score games. Many analysts make a fair case that one score games can be mostly determined by random events like fumbles, missed FGs, blown coverages, bobbled picks, etc. What is rarely discussed is where a one score game is really a result of the same events happening in a two score game. A team getting a lucky event making a game closer that it deserves to be will be found by this metric. Furthermore, two lucky events can decide the outcome of a two score game the same way. Basic binomial probability makes it clear that a large number of random events can sway in favor of one team over another in a majority of games. Looking at the number of one score games to see which teams have been "lucky" or "unlucky" is a much important step over looking at raw wins/losses, yet not every one score game, or even game of equal Point Diff is created equally.


Results:


Teams ranked by WP-Expected.
Scatterplot of WP-Expected versus Actual Wins.
Teams ranked by residual.


Relevance:
Does Win Probability Expected wins stack up with the competition?
Higher R^2 between WP-Expected and Actual Wins than Pyth Wins or DVOA Wins respectively.


It does not beat Pyth Wins for the 2018 season, but it is close! (Technically for only 31 teams as a merge error kept LA/LAR off. Oops)


This is only the case for least squares derived linear relationship though, not an R^2 for the neutral relationship.
[wins = a + b(WP-Expected)] versus [wins = WP-Expected]
WP-Expected does not beat either DVOA Wins or Pyth Wins.
How I interpret this is the actual expected win count for WP-Expected is not as important as the rankings WP-Expected produces.

One reason why the neutral-R^2 is much smaller is the standard deviation of WP-Expected is much smaller than Pyth, DVOA, or Actual wins.
There is much higher central tendency to WP-Expected, so either WP could be used in a different function for predictive power or games are more volatile to random events than appreciated (or both).

Another way to visualize this is a density plot of the residuals. The residuals are the difference between the expected wins and actual wins. The spreads of the metric will be reversed, and we can see much more spread in WP-Expected residuals than DVOA or Pyth wins.


Other Predictability Analysis:
The litmus test for predictability is comparing midseason splits. Is WP performance in early games reliable at predicting performance in later games?
The method for this I used was splitting along the first 8 weeks to the following 9, and finding the correlation between them.
Not encouraging, frankly.
One issue with this though, is an unfair distribution of the bye week, meaning the quantity of expected wins is over a 7 game period for some teams, meaning the it is predicting for 9 remaining games. The actual value of expected wins won't match up, leading to more variability and lower correlation.
To normalize this for better comparison, MWP is calculated for each play regardless of specific games is the equivalent to WP-Expected divided by games played, and hopefully better correlation.


This correlation did improve significantly, but not good enough for reliable predictability. Some optimism though, this could still be incorporated into the analysis toolbox.
This could be an element of prediction methods along with other metrics like point differential, DVOA, EPA, injuries, strength of schedule, etc. A good metric may be the result of tweaking or incorporating other formulation to improve predictability. Currently it seems to be more of explanatory stat that could better differentiate between "lucky" and "unlucky" game outcomes, particularly the analysis of one-score games.



Interesting Results:
 Dallas Cowboys:

WP-Expected was much lower on the Cowboys than DVOA or Pyth Wins, closer in line with Actual Wins though. Possible evidence for the perception of them performing best in garbage time. If WP is relevant for game events, and DVOA or Point Diff do not consider it, there is bound to be teams that DVOA and Pyth wins over/under estimate performance for (not that WP-Expected doesn't over/under performance either).

Packers vs Vikings

There was a lot of controversy about how good the "winning ugly" Packers were, depite being 13-3 and the number 2 seed. One particular arguement is about the outcomes of their one-score games. The controversy extended to claims that the 10-6 Vikings were likely as good or better of a team. Pyth Wins and DVOA did in fact support this claim, both metrics being higher on the Vikings than Packers.
That said, WP-Expected was higher on the Packers, closer to where they ended up in Actual wins, suggesting maybe they were succeptible to conservative playcalling and allowing teams to keep games closer than people would expect given their MWP (thus winning ugly).

Detroit Lions:

Had to include the team that had the highest residual between expected and actual, even though that residual was negative. It seems every game was a roller coaster that ended with being vomited on by the kid behind you.
Positive regression is likely next year, assuming they stay relatively the same team, but we don't know if late game coaching decisions are sustainable in the vein of Andy Ried.

Seahawks:

One of the wildest teams of the season

Conclusion:
It might seem obvious to some, but very different stats are not necessarily redundant just because differences in predictive or explanatory power.
These stats highlight different phenomena and suggest analysis should incorporate any relavant stat, especially when they make different claims.
I hope more win probability analysis becomes a common tool in the analytics toolbox rather than replace good stats we already know.



I want to thank @benbbaldwin, @friscojosh, @Stat_Ron, and especially @LeeSharpeNFL for their wonderful work with nflscrapR and awesome guides!
DVOA and Pythagorean Wins data from footballoutsiders.com, game data from nflscrapR package in R
https://gist.github.com/guga31bb/5634562c5a2a7b1e9961ac9b6c568701
https://github.com/leesharpe/nfldata/blob/master/RSTUDIO-INTRO.md
https://github.com/leesharpe/nfldata/blob/master/WPCHARTS.md
https://github.com/leesharpe/nfldata/blob/master/COLORS_IN_R.md
https://github.com/maksimhorowitz/nflscrapR/blob/master/R/ep_wp_calculator.R
https://github.com/maksimhorowitz/nflscrapR
https://www.dropbox.com/s/5k2wbnroyn8i1ux/Getting%20Started%20With%20R%20for%20NFL.pdf?dl=0

Analysis and Report by Kevin Kraege found at @kevgk2 on Twitter
results:
https://drive.google.com/file/d/1JKWKxCW6MqaPozZNnQFh1lm7XAYGOypE/view?usp=sharing
code here:
https://comfortablynumb-ers.blogspot.com/2020/02/rstudio-code-for-mwp.html

Comments

Popular posts from this blog

Profiling 2019 NFL Offenses with nflscrapR Data and Clustering

Using the Excel Nonlinear Solver to Optimize Skill Trees with Borderlands 3 Example

Jordan Love Was The Right Pick In Theory