- Wednesday, October 11, 2006
Calculating Baseball Statistics With LINQ
I'm a big fan of the new features forthcoming in C# 3.0. The most significant addition is LINQ -- or language integrated query. Microsoft shipped an early CTP preview of the new features, along with an impressive number of samples demonstrating the new functionality.
Unfortunately, like most Microsoft-authored samples, they are focused around boring business use cases. Maybe it's just me, but I like seeing innovative language features demonstrating something other than calculating the average price for in-stock products (yawn).
Meanwhile, I had recently started working with Retrosheet.org's event files, which contain detailed play-by-play data for most Major League Baseball seasons. After I got a basic representation of the data into memory, I decided to come up with some more interesting LINQ queries to calculate popular baseball stats. I've written many more queries than you see here, but many of them are too similar to be of interest as you might imagine. I chose the examples below based more on their utility as LINQ examples than their utility as useful indicators of baseball talent (that's another post).
Let's play LINQ-ball:
Win Leaders
This query determines the pitchers with the most wins in the majors.
WriteRankedList(from game in games group game by game.WinningPitcher into g orderby g.Count() descending select g);
The
WriteRankedList
method shows the top 5 entries in the returned list, which is of typeIGrouping<Player, int>
. The example above uses the new query syntax -- here's an alternative way of writing the query which uses the extension methods directly:WriteRankedList(games.GroupBy(g => g.WinningPitcher). OrderByDescending(g => g.Count()));
In both cases, the following output is produced:
2005 Leaders: Wins 1. Dontrelle Willis (22) 2. Chris Carpenter (21) 2. Bartolo Colon (21) 4. Roy Oswalt (20) 5. Jon Garland (18) 5. Chris Capuano (18) 5. Cliff Lee (18)
The player name information is stored in the
WinningPitcher
property, which is an instance of aPlayer
object.Save Leaders
Calculating saves is very similar to calculating wins, but involves a check for whether or not there actually was a save.
WriteRankedList(from game in games where game.Save != null group game by game.Save into g orderby g.Count() descending select g);
Batting Average Leaders
Batting average is more complicated, and introduces a new LINQ keyword --
let
. This allows us to store the result of the calculation in the first portion of the query so we can sort based on the pre-computed data:var battingAverageLeaders = from game in games from play in game.GetPlays() where play.Result != null group play by play.Batter into g where g.Count(p => p.Result.IsAtBat) > 400 let pd = new PlayerData { Player = g.Key, Data = (Math.Round(1000.0d * ((double)g.Count(p => p.Result.IsHit) / (double)g.Count(p => p.Result.IsAtBat))) / 1000.0d). ToString(".000") } orderby pd.Data descending select pd;
For the sake of comparison, here's how to perform the calculation using the LINQ extension methods directly:
var battingAverageLeaders = games.SelectMany(g => g.GetPlays()).Where(p => p.Result != null). GroupBy(p => p.Batter).Where(g => g.Count(p => p.Result.IsAtBat) > 400). Select(g => new PlayerData { Player = g.Key, Data = (Math.Round(1000.0d * ((double)g.Count(p => p.Result.IsHit) / (double)g.Count(p => p.Result.IsAtBat))) / 1000.0d). ToString(".000") }).OrderByDescending(pd => pd.Data);
This is a good example of a query where the new query syntax improves readability considerably.
On Base Average
Calculating on base average requires some more complicated checks -- to enhance readability, I partitioned the pieces of the calculation into separate lambdas.
Func<Play, bool> obaPredicateTop = (p => p.Result.IsHit || p.Result == PlayResult.Walk || p.Result == PlayResult.IntentionalWalk || p.Result == PlayResult.HitByPitch); Func<Play, bool> obaPredicateBottom = (p => p.Result.IsAtBat || p.Result == PlayResult.Walk || p.Result == PlayResult.IntentionalWalk || p.Result == PlayResult.HitByPitch || p.Result == PlayResult.SacrificeFly); var onBaseAverage = from game in games from play in game.GetPlays() where play.Result != null group play by play.Batter into g where g.Count() > 400 let pd = new PlayerData { Player = g.Key, Data = (Math.Round(1000.0d * ((double)g.Count(obaPredicateTop) / (double)g.Count(obaPredicateBottom))) / 1000.0d). ToString(".000") } orderby pd.Data descending select pd;
Since I haven't shown the actual program output since the "wins" sample, here's the output for on-base average:
2005 Leaders: On Base Average (Minimum 400 PA) 1. Todd Helton (.432) 2. Jason Giambi (.422) 3. Albert Pujols (.418) 4. Alex Rodriguez (.410) 5. Brian Giles (.409)
Other statistics can be calculated using similar mechanisms. I haven't shown any examples involving runner movement (available per-play), since that would require a bit more explanation about how I'm parsing the log files, and that code isn't quite pretty enough to share just yet. Once I get further along, I'll release the game parsing code so you can write your own queries!
Comments
- Monday, October 16, 2006 7:53:12 PM by Matt Freiburg
- Wednesday, October 18, 2006 5:30:23 PM by Dan MiserThis is phenomenal. Nicely done.
I, too, would love to see more articles in this series on how you structured, parsed, and imported the game data, along with adding more info to an application that could be used to answer tons of "what ifs". - Thursday, January 04, 2007 1:58:21 PM by Chua Wen ChingWow. Thanks for sharing. It englightens me. Cool stuff.
- Monday, January 15, 2007 12:42:26 AM by xxThis is great? but Visual FoxPro 9.0 has this arrond 5 years old.
- Monday, June 02, 2008 4:12:27 PM by Stephen F. PimientaI've done a few statistics with LINQPAD I bought the O'Reilly ‘s LINQ pocket reference and the samples were ridiculous. Tom ,Dick ,and Harry examples to think that people get paid for providing such lame examples. As far as Foxpro goes yes I am sure it can do this… so can Excel. The issue is what can be done in a program outside of a database and the ease in which can be done. It can be done in any programming language C, C++, VB, perl, C# and if you want to explore functional programming (Read Excel under the covers) further Ocaml , Haskel and Microsoft’s F# . With that said thanks for providing a useful example.
I always wanted to be a baseball statistician when I grew up, and now that I'm in IT, I've always pondered how the real players in the field store game data. This is going to answer a LOT of questions I've had stirring in the back of my mind for some time now.