The R package pitchRx provides tools for collecting Major League Baseball (MLB) Gameday data and visualizing PITCHf/x. This page provides a rough overview of it’s scope, but the RJournal article is more comprehensive. The source file used to generate this page is helpful to see how to embed pitchRx animations in to documents using knitr. If coding isn’t your thing, you might want to just play with my PITCHf/x visualization app!

Data Collection

Collecting ‘smallish’ data

pitchRx makes it simple to acquire PITCHf/x directly from its source. Here, pitchRx’s scrape() function is used to collect all PITCHf/x data recorded on June 1st, 2013.

dat <- scrape(start = "2013-06-01", end = "2013-06-01")
## [1] "atbat"  "action" "pitch"  "po"     "runner"
## [1] 4682   49

By default, scrape() returns a list of 5 data frames. The 'pitch' data frame contains the actual PITCHf/x data which is recorded on a pitch-by-pitch basis. The dimensions of this data frame indicate that 4682 pitches were thrown on June 1st, 2013. If your analysis requires PITCHf/x data over many months, you surely don’t want to pull all that data into a single R session! For this (and other) reasons, scrape() can write directly to a database (see the “Managing PITCHf/x data” section).

Collecting data by Gameday IDs

In the previous example, scrape() actually determines the relevant game IDs based on the start and end date. If the user wants a more complicated query based to specific games, relevant game IDs can be passed to the game.ids argument using the built in gids data object.

data(gids, package = "pitchRx")
## [1] "gid_2008_02_26_fanbbc_phimlb_1" "gid_2008_02_26_flsbbc_detmlb_1"
## [3] "gid_2008_02_26_umibbc_flomlb_1" "gid_2008_02_26_umwbbc_nynmlb_1"
## [5] "gid_2008_02_27_cinmlb_phimlb_1" "gid_2008_02_27_colmlb_chamlb_1"

As you can see, the gids object contains game IDs and those IDs contain relevant dates as well as abbreviations for the home and away team name. Since the away team is always listed first, we could do the following to collect PITCHf/x data from every away game played by the Minnesota Twins in July of 2013.

MNaway13 <- gids[grep("2013_06_[0-9]{2}_minmlb*", gids)]
dat2 <- scrape(game.ids = MNaway13)

Managing PITCHf/x data in bulk

Creating and maintaining a PITCHf/x database is a breeze with pitchRx and dplyr. With a few lines of code (and some patience), all available PITCHf/x data can be obtained directly from its source and stored in a local SQLite database:

db <- src_sqlite("pitchfx.sqlite3", create = T)
scrape(start = "2008-01-01", end = Sys.Date(), connect = db$con)

The website which hosts PITCHf/x data hosts a wealth of other data that might come in handy for PITCHf/x analysis. The file type which contains PITCHf/x always ends with inning/inning_all.xml. scrape also has support to collect data from three other types of files: miniscoreboard.xml, players.xml, and inning/inning_hit.xml. Data from these files can easily be added to our existing PITCHf/x database:

files <- c("miniscoreboard.xml", "players.xml", "inning/inning_hit.xml")
scrape(start = "2008-01-01", end = Sys.Date(), suffix = files, connect = db$con)

Building your own custom scraper

pitchRx is built on top of the R package XML2R. In this post, I demonstrate how to use XML2R and pitchRx to collect attendance data from the GameDay site (similar methods can be used to collect other GameDay data). For a more detailed look at XML2R, see the introductory webpage and/or the RJournal paper.

PITCHf/x Visualization

2D animation

The pitchRx comes pre-packaged with a pitches data frame with four-seam and cut fastballs thrown by Mariano Rivera and Phil Hughes during the 2011 season. These pitches are used to demonstrate PITCHf/x animations using animateFX(). The viewer should notice that as the animation progresses, pitches coming closer to them (that is, imagine you are the umpire/catcher - watching the pitcher throw directly at you). In the animation below, the horizontal and vertical location of pitches is plotted every tenth of a second until they reach home plate (in real time). Since looking at animations in real time can be painful, this animation delays the time between each frame to a half a second.

# adding ggplot2 functions to customize animateFX() output won't work, but
# you can pass a list to the layer argument like this:
x <- list(
  facet_grid(pitcher_name ~ stand, labeller = label_both), 
animateFX(pitches, layer = x)