- Wednesday, October 11, 2006
## Calculating Baseball Statistics With LINQ

I'm a big fan of the new features forthcoming in C# 3.0. The most significant addition is LINQ -- or

**l**anguage**in**tegrated**q**uery. Microsoft shipped an early CTP preview of the new features, along with an impressive number of samples demonstrating the new functionality.Unfortunately, like most Microsoft-authored samples, they are focused around boring business use cases. Maybe it's just me, but I like seeing innovative language features demonstrating something other than calculating the average price for in-stock products (yawn).

Meanwhile, I had recently started working with Retrosheet.org's event files, which contain detailed play-by-play data for most Major League Baseball seasons. After I got a basic representation of the data into memory, I decided to come up with some more interesting LINQ queries to calculate popular baseball stats. I've written many more queries than you see here, but many of them are too similar to be of interest as you might imagine. I chose the examples below based more on their utility as LINQ examples than their utility as useful indicators of baseball talent (that's another post).

Let's play LINQ-ball:

#### Win Leaders

This query determines the pitchers with the most wins in the majors.

WriteRankedList(from game in games group game by game.WinningPitcher into g orderby g.Count() descending select g);

The

`WriteRankedList`

method shows the top 5 entries in the returned list, which is of type`IGrouping<Player, int>`

. The example above uses the new query syntax -- here's an alternative way of writing the query which uses the extension methods directly:WriteRankedList(games.GroupBy(g => g.WinningPitcher). OrderByDescending(g => g.Count()));

In both cases, the following output is produced:

2005 Leaders: Wins 1. Dontrelle Willis (22) 2. Chris Carpenter (21) 2. Bartolo Colon (21) 4. Roy Oswalt (20) 5. Jon Garland (18) 5. Chris Capuano (18) 5. Cliff Lee (18)

The player name information is stored in the

`WinningPitcher`

property, which is an instance of a`Player`

object.#### Save Leaders

Calculating saves is very similar to calculating wins, but involves a check for whether or not there actually was a save.

WriteRankedList(from game in games where game.Save != null group game by game.Save into g orderby g.Count() descending select g);

#### Batting Average Leaders

Batting average is more complicated, and introduces a new LINQ keyword --

`let`

. This allows us to store the result of the calculation in the first portion of the query so we can sort based on the pre-computed data:var battingAverageLeaders = from game in games from play in game.GetPlays() where play.Result != null group play by play.Batter into g where g.Count(p => p.Result.IsAtBat) > 400 let pd = new PlayerData { Player = g.Key, Data = (Math.Round(1000.0d * ((double)g.Count(p => p.Result.IsHit) / (double)g.Count(p => p.Result.IsAtBat))) / 1000.0d). ToString(".000") } orderby pd.Data descending select pd;

For the sake of comparison, here's how to perform the calculation using the LINQ extension methods directly:

var battingAverageLeaders = games.SelectMany(g => g.GetPlays()).Where(p => p.Result != null). GroupBy(p => p.Batter).Where(g => g.Count(p => p.Result.IsAtBat) > 400). Select(g => new PlayerData { Player = g.Key, Data = (Math.Round(1000.0d * ((double)g.Count(p => p.Result.IsHit) / (double)g.Count(p => p.Result.IsAtBat))) / 1000.0d). ToString(".000") }).OrderByDescending(pd => pd.Data);

This is a good example of a query where the new query syntax improves readability considerably.

#### On Base Average

Calculating on base average requires some more complicated checks -- to enhance readability, I partitioned the pieces of the calculation into separate lambdas.

Func<Play, bool> obaPredicateTop = (p => p.Result.IsHit || p.Result == PlayResult.Walk || p.Result == PlayResult.IntentionalWalk || p.Result == PlayResult.HitByPitch); Func<Play, bool> obaPredicateBottom = (p => p.Result.IsAtBat || p.Result == PlayResult.Walk || p.Result == PlayResult.IntentionalWalk || p.Result == PlayResult.HitByPitch || p.Result == PlayResult.SacrificeFly); var onBaseAverage = from game in games from play in game.GetPlays() where play.Result != null group play by play.Batter into g where g.Count() > 400 let pd = new PlayerData { Player = g.Key, Data = (Math.Round(1000.0d * ((double)g.Count(obaPredicateTop) / (double)g.Count(obaPredicateBottom))) / 1000.0d). ToString(".000") } orderby pd.Data descending select pd;

Since I haven't shown the actual program output since the "wins" sample, here's the output for on-base average:

2005 Leaders: On Base Average (Minimum 400 PA) 1. Todd Helton (.432) 2. Jason Giambi (.422) 3. Albert Pujols (.418) 4. Alex Rodriguez (.410) 5. Brian Giles (.409)

Other statistics can be calculated using similar mechanisms. I haven't shown any examples involving runner movement (available per-play), since that would require a bit more explanation about how I'm parsing the log files, and that code isn't quite pretty enough to share just yet. Once I get further along, I'll release the game parsing code so you can write your own queries!