The XML2R package is meant to simplify scraping XML data from XML files hosted on the web. This example page demonstrates the ideas behind XML2R by walking through pseudo-source code of the scrape
function in the pitchRx package. Similar ideas have been implemented in the bbscrapeR package which provides a high-level API to statistics from nba.com and wnba.com.
The main function in XML2R, XML2Obs
, coerces XML content into a list of observations. An observation is technically a matrix with one row and possibly many columns. One observation includes all XML attributes and the XML value for a particular XML lineage. The name of each list element (or each observation) tracks the XML hierarchy so observations can be grouped together in a sensible fashion at a later point. In the code below, we obtain 946 observations from two different XML files that contain data on Major League Baseball games played on June 14th, 2013.
library(XML2R) pre <- "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/" post <- c("gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml", "gid_2013_06_14_seamlb_oakmlb_1/inning/inning_all.xml") urls <- paste0(pre, post) obs <- XML2Obs(urls, as.equiv=TRUE, quiet=TRUE) table(names(obs))
## ## game game//inning game//inning//bottom//action ## 2 18 18 ## game//inning//bottom//atbat game//inning//bottom//atbat//pitch game//inning//bottom//atbat//po ## 76 255 6 ## game//inning//bottom//atbat//runner game//inning//top//action game//inning//top//atbat ## 66 15 82 ## game//inning//top//atbat//pitch game//inning//top//atbat//po game//inning//top//atbat//runner ## 321 10 77
This output tells us that 255 pitches were thrown in the bottom inning and 321 were thrown in the top inning during these two games. Eventually, we will want to combine observations named 'game//inning//bottom//atbat//pitch'
and 'game//inning//top//atbat//pitch'
into the same table since they share XML attributes (in other words, the observations share variables). XML2R has another function collapse_obs
that can be used to aggregate observations into the same table based on their names.
Before aggregating observations into a collection of tables, we sometimes need to re_name
observations. As mentioned previously, we have some observations with different names, but they should be put into the same table. By passing these names (that should be considered to be on the same level) to the equiv
argument, re_name
will automatically determine the difference in the naming scheme and suppress that difference.
tmp <- re_name(obs, equiv=c("game//inning//top//atbat//pitch", "game//inning//bottom//atbat//pitch"), diff.name="side") tmp <- re_name(tmp, equiv=c("game//inning//top//atbat//runner", "game//inning//bottom//atbat//runner"), diff.name="side") tmp <- re_name(tmp, equiv=c("game//inning//top//atbat//po", "game//inning//bottom//atbat//po"), diff.name="side") tmp <- re_name(tmp, equiv=c("game//inning//top//atbat", "game//inning//bottom//atbat"), diff.name="side") obs2 <- re_name(tmp, equiv=c("game//inning//top//action", "game//inning//bottom//action"), diff.name="side") table(names(obs2))
## ## game game//inning game//inning//action game//inning//atbat ## 2 18 33 158 ## game//inning//atbat//pitch game//inning//atbat//po game//inning//atbat//runner ## 576 16 143
Now we see that 576 pitches were thrown in total. The information we removed is not lost; however, as a new column is appended to the end of each relevant observation. For example, notice how the side column contains the part of the name we just removed:
obs2[grep("game//inning//atbat//po", names(obs2))][1:2]
## $`game//inning//atbat//po` ## des ## [1,] "Pickoff Attempt 1B" ## url ## [1,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml" ## side ## [1,] "bottom" ## ## $`game//inning//atbat//po` ## des ## [1,] "Pickoff Attempt 1B" ## url ## [1,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml" ## side ## [1,] "top"
After all that renaming, we now have 7 different levels of observations. Let's examine observations on the game\\inning
level:
obs2[grep("^game//inning$", names(obs2))][1:3]
## $`game//inning` ## num away_team home_team next ## [1,] "1" "phi" "col" "Y" ## url ## [1,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml" ## ## $`game//inning` ## num away_team home_team next ## [1,] "2" "phi" "col" "Y" ## url ## [1,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml" ## ## $`game//inning` ## num away_team home_team next ## [1,] "3" "phi" "col" "Y" ## url ## [1,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml"
Before we aggregate observations into tables, we should preserve the parent-to-child relationships in the XML lineage. For example, we want to be able to identify in what inning a particular pitch was thrown.
obswkey <- add_key(obs2, parent = "game//inning", recycle = "num", key.name = "inning")
As it turns out, the away_team
and home_team
columns are redundant as this information is embeded in the url
column. Thus, there is only one other informative attribute on this level which is next
. By recycling this value to all of its descendants as well, we remove any need to retain a game//inning
table.
obswkey <- add_key(obswkey, parent = "game//inning", recycle = "next")
It is also imperative that we can identify which atbat
a particular pitch
, runner
, and po
belongs to. This can be done as follows:
obswkey <- add_key(obswkey, parent = "game//inning//atbat", recycle = "num")
Finally, we are in a position to pool together observations that have a common name. The collapse_obs
function achieves this by row binding observations together and returning a list of matrices. Note that collapse_obs
does not require that observations from the same level to have the same set of variables in order to be binded into a common table. In the case where variables are missing, NA
s will be used as the value.
tables <- collapse_obs(obswkey) # As mentioned before, we don't need the 'inning' table tables <- tables[-grep("^game//inning$", names(tables))] table.names <- c("game", "action", "atbat", "pitch", "po", "runner") tables <- setNames(tables, table.names) head(tables[["runner"]])
## id start end event score rbi earned ## [1,] "471865" "" "" "Home Run" "T" "T" "T" ## [2,] "429667" "" "1B" "Walk" NA NA NA ## [3,] "429667" "1B" "2B" "Hit By Pitch" NA NA NA ## [4,] "435623" "" "1B" "Hit By Pitch" NA NA NA ## [5,] "435623" "1B" "" "Flyout" NA NA NA ## [6,] "429667" "2B" "" "Flyout" NA NA NA ## url ## [1,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml" ## [2,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml" ## [3,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml" ## [4,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml" ## [5,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml" ## [6,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml" ## side inning next num ## [1,] "bottom" "1" "Y" "6" ## [2,] "top" "2" "Y" "8" ## [3,] "top" "2" "Y" "9" ## [4,] "top" "2" "Y" "9" ## [5,] "top" "2" "Y" "12" ## [6,] "top" "2" "Y" "12"
At this point, you probably want to save these tables. I strongly recommend writing to a database if you plan on repeatedly querying your data. There are multiple packages that allow one to write tables to a database in R including: RSQLite, RMySQL, rmongodb, and DBI. Here is a simple way to create a SQLite database using the dplyr package and copy our tables to that database:
library(dplyr) db <- src_sqlite("my_db.sqlite3") for (i in names(tables)) copy_to(db, df = tables[[i]], name = i)