Derek Slager

  • Wednesday, October 11, 2006

    Calculating Baseball Statistics With LINQ

    I'm a big fan of the new features forthcoming in C# 3.0. The most significant addition is LINQ -- or language integrated query. Microsoft shipped an early CTP preview of the new features, along with an impressive number of samples demonstrating the new functionality.

    Unfortunately, like most Microsoft-authored samples, they are focused around boring business use cases. Maybe it's just me, but I like seeing innovative language features demonstrating something other than calculating the average price for in-stock products (yawn).

    Meanwhile, I had recently started working with Retrosheet.org's event files, which contain detailed play-by-play data for most Major League Baseball seasons. After I got a basic representation of the data into memory, I decided to come up with some more interesting LINQ queries to calculate popular baseball stats. I've written many more queries than you see here, but many of them are too similar to be of interest as you might imagine. I chose the examples below based more on their utility as LINQ examples than their utility as useful indicators of baseball talent (that's another post).

    Let's play LINQ-ball:

    Win Leaders

    This query determines the pitchers with the most wins in the majors.

    WriteRankedList(from game in games
                    group game by game.WinningPitcher into g
                    orderby g.Count() descending
                    select g);

    The WriteRankedList method shows the top 5 entries in the returned list, which is of type IGrouping<Player, int>. The example above uses the new query syntax -- here's an alternative way of writing the query which uses the extension methods directly:

    WriteRankedList(games.GroupBy(g => g.WinningPitcher).
        OrderByDescending(g => g.Count()));

    In both cases, the following output is produced:

    2005 Leaders: Wins
    1. Dontrelle Willis (22)
    2. Chris Carpenter (21)
    2. Bartolo Colon (21)
    4. Roy Oswalt (20)
    5. Jon Garland (18)
    5. Chris Capuano (18)
    5. Cliff Lee (18)

    The player name information is stored in the WinningPitcher property, which is an instance of a Player object.

    Save Leaders

    Calculating saves is very similar to calculating wins, but involves a check for whether or not there actually was a save.

    WriteRankedList(from game in games
                    where game.Save != null
                    group game by game.Save into g
                    orderby g.Count() descending
                    select g);

    Batting Average Leaders

    Batting average is more complicated, and introduces a new LINQ keyword -- let. This allows us to store the result of the calculation in the first portion of the query so we can sort based on the pre-computed data:

    var battingAverageLeaders =
        from game in games
        from play in game.GetPlays()
        where play.Result != null
        group play by play.Batter into g
        where g.Count(p => p.Result.IsAtBat) > 400
        let pd = new PlayerData {
            Player = g.Key,
            Data = (Math.Round(1000.0d *
                               ((double)g.Count(p => p.Result.IsHit) /
                                (double)g.Count(p => p.Result.IsAtBat))) / 1000.0d).
                                    ToString(".000") }
        orderby pd.Data descending
        select pd;

    For the sake of comparison, here's how to perform the calculation using the LINQ extension methods directly:

    var battingAverageLeaders =
        games.SelectMany(g => g.GetPlays()).Where(p => p.Result != null).
        GroupBy(p => p.Batter).Where(g => g.Count(p => p.Result.IsAtBat) > 400).
        Select(g => new PlayerData {
                   Player = g.Key,
                   Data = (Math.Round(1000.0d *
                                      ((double)g.Count(p => p.Result.IsHit) /
                                       (double)g.Count(p => p.Result.IsAtBat))) / 1000.0d).
                                           ToString(".000")
               }).OrderByDescending(pd => pd.Data);

    This is a good example of a query where the new query syntax improves readability considerably.

    On Base Average

    Calculating on base average requires some more complicated checks -- to enhance readability, I partitioned the pieces of the calculation into separate lambdas.

    Func<Play, bool> obaPredicateTop =
        (p => p.Result.IsHit ||
              p.Result == PlayResult.Walk ||
              p.Result == PlayResult.IntentionalWalk ||
              p.Result == PlayResult.HitByPitch);
    
    Func<Play, bool> obaPredicateBottom =
        (p => p.Result.IsAtBat ||
              p.Result == PlayResult.Walk ||
              p.Result == PlayResult.IntentionalWalk ||
              p.Result == PlayResult.HitByPitch ||
              p.Result == PlayResult.SacrificeFly);
    
    var onBaseAverage =
        from game in games
        from play in game.GetPlays()
        where play.Result != null
        group play by play.Batter into g
        where g.Count() > 400
        let pd = new PlayerData {
            Player = g.Key,
            Data = (Math.Round(1000.0d *
                    ((double)g.Count(obaPredicateTop) /
                     (double)g.Count(obaPredicateBottom))) / 1000.0d).
                         ToString(".000") }
        orderby pd.Data descending
        select pd;

    Since I haven't shown the actual program output since the "wins" sample, here's the output for on-base average:

    2005 Leaders: On Base Average (Minimum 400 PA)
    1. Todd Helton (.432)
    2. Jason Giambi (.422)
    3. Albert Pujols (.418)
    4. Alex Rodriguez (.410)
    5. Brian Giles (.409)

    Other statistics can be calculated using similar mechanisms. I haven't shown any examples involving runner movement (available per-play), since that would require a bit more explanation about how I'm parsing the log files, and that code isn't quite pretty enough to share just yet. Once I get further along, I'll release the game parsing code so you can write your own queries!

    5 CommentsPosted at 4:25 AM to Categories: LINQ Baseball

About

VP of Development at Appature, a Seattle-area company building a healthcare-focused relationship marketing platform.

Email Me

@derekslager

I'm hiring!

Recent

  • Introducing Closure Snippets
  • An Introduction to Google Closure
  • DVCS Myths
  • A DVCS Story
  • BCrypt.net - Strong Password Hashing for .NET and Mono

Categories

  • .NET
  • Baseball
  • C#
  • DVCS
  • Emacs
  • Google Closure
  • JavaScript
  • LINQ
  • Security
  • Windows
© Copyright 2013 Derek Slager