3. Data Download

This vignette shows how to download data from Dataverse using the dataverse package. We’ll focus on a Dataverse repository that contains supplemental files for the book Political Analysis Using R (2015), which is stored at Harvard’s Dataverse Server (https://dataverse.harvard.edu).

The Dataverse entry for this study is persistently retrievable by a “Digital Object Identifier (DOI)”: https://doi.org/10.7910/DVN/ARKOTI and the citation on the Dataverse Page includes a “Universal Numeric Fingerprint (UNF)”: UNF:6:+itU9hcUJ8I9E0Kqv8HWHg==, which provides a versioned, multi-file hash for the entire study, which contains 32 files.

The following examples will draw from the Harvard Dataverse, so it is convenient to set this as a default environment variable.

Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")

This is equivalent to setting server = "dataverse.harvard.edu" in every dataverse function each time. Note that if you set an environment variable like the above, that operation is necessary to make your code reproducible on a different machine.

For downloading a public dataset, no API Key is needed.

Retrieving Plain-Text Data

We will download public data files and examine them directly in R using the dataverse package.

library("dataverse")
library("tibble") # to see dataframes in tidyverse-form

First, we retrieve a plain-text file like this dataset on electricity consumption by Wakiyama et al. (2014). Taking the file name and dataset DOI from this entry,

energy <- get_dataframe_by_name(
  filename = "comprehensiveJapanEnergy.tab",
  dataset = "10.7910/DVN/ARKOTI", 
  server = "dataverse.harvard.edu")
head(energy)
## # A tibble: 6 × 10
##    time date  dummy  temp temp2      all    large    house    kepco    tepco
##   <dbl> <chr> <dbl> <dbl> <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1     1 8-Jan     0   5.9  34.8 95792389 35194957 26190714 13357735 26960899
## 2     2 8-Feb     0   5.5  30.3 95156901 35322031 24224097 13315027 27189705
## 3     3 8-Mar     0  10.7 114.  91034047 36474192 21391965 12805831 24495519
## 4     4 8-Apr     0  14.7 216.  84087552 34949622 18494473 11494328 23540356
## 5     5 8-May     0  18.5 342.  82742929 35417089 17923760 11589061 22848737
## 6     6 8-Jun     0  21.3 454.  82180013 36692291 15205229 11360771 22487441

These get_dataframe_* functions, introduced in v0.3.0, directly read in the data into a R environment through whatever R function supplied by .f. The default of the get_dataframe_* functions is to read in such data by readr::read_tsv(). The .f function can be modified to modify the read-in settings. For example, the following modification is a base-R equivalent to read in the ingested data.

library(readr)
energy <- get_dataframe_by_name(
  filename = "comprehensiveJapanEnergy.tab",
  dataset = "10.7910/DVN/ARKOTI", 
  server = "dataverse.harvard.edu",
  .f = function(x) read.delim(x, sep = "\t"))

head(energy)
##   time  date dummy temp temp2      all    large    house    kepco    tepco
## 1    1 8-Jan     0  5.9  34.8 95792389 35194957 26190714 13357735 26960899
## 2    2 8-Feb     0  5.5  30.3 95156901 35322031 24224097 13315027 27189705
## 3    3 8-Mar     0 10.7 114.5 91034047 36474192 21391965 12805831 24495519
## 4    4 8-Apr     0 14.7 216.1 84087552 34949622 18494473 11494328 23540356
## 5    5 8-May     0 18.5 342.3 82742929 35417089 17923760 11589061 22848737
## 6    6 8-Jun     0 21.3 453.7 82180013 36692291 15205229 11360771 22487441

The dataverse package can also download datasets that are drafts (i.e. versions not released publicly), as long as the user of the dataset provides their appropriate DATAVERSE_KEY. Users may need to modify the metadata of a datafile, such as adding a descriptive label, for the data downloading to work properly in this case. This is because the the file identifier UNF, which the read function relies on, may only appear after metadata has been added.

Caching large datasets

As of v0.3.15, datasets are cached on your computer if the user specifies a version of the dataset. The next time the code is run, the function will read from the cache rather than re-downloading from the Dataverse. Version specification can be done, e.g., by setting version = "3" for V3, for instance. This is useful to avoid re-downloading the identical dataset every time, especially if they take some time to download. To turn off or view the settings of caching, see cache_dataset().

Retrieving Custom Data Formats (RDS, Stata, SPSS)

If a file is displayed on dataverse as a .tab file like the survey data by Alvarez et al. (2013), it is likely that Dataverse ingested the file to a plain-text, tab-delimited format.

argentina_tab <- get_dataframe_by_name(
  filename = "alpl2013.tab",
  dataset = "10.7910/DVN/ARKOTI",
  server = "dataverse.harvard.edu")

However, ingested files may not retain important dataset attributes. For example, Stata and SPSS datasets encode value labels on to numeric values. Factor variables in R dataframes encode levels, not only labels. A plain-text ingested file will discard such information. For example, the polling_place variable in this data is only given by numbers, although the original data labelled these numbers with informative values.

str(argentina_tab$polling_place)
## num [1:1475] 31 31 31 31 31 31 31 31 31 31 ...

When ingesting, Dataverse retains a original version that retains these attributes but may not be readable in some platforms. The get_dataframe_* functions have an argument that can be set to original = TRUE. In this case we know that alpl2013.tab was originally a Stata dta file, so we can run:

argentina_dta <- get_dataframe_by_name(
  filename = "alpl2013.tab",
  dataset = "10.7910/DVN/ARKOTI",
  server = "dataverse.harvard.edu",
  original = TRUE,
  .f = haven::read_dta)

Now we see that labels are read in through haven’s labelled variables class:

str(argentina_dta$polling_place)
##  dbl+lbl [1:1475] 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 3...
##  @ label       : chr "polling_place"
##  @ format.stata: chr "%9.0g"
##  @ labels      : Named num [1:37] 1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "names")= chr [1:37] "E.E.T." "Escuela Juan Bautista Alberdi" "Escuela Juan Carlos Dávalos" "Escuela Bernardino de Rivadavia" ...

Users should pick .f and original based on their existing knowledge of the file. If the original file is a .sav SPSS file, .f can be haven::read_sav. If it is a .Rds file, use readRDS or readr::read_rds. In fact, because the raw data is read in as a binary, there is no limitation to the file types get_dataframe_* can read in, as far as the dataverse package is concerned.

There are two more ways to read in a dataframe other than get_dataframe_by_name().

  • get_dataframe_by_doi() takes in a file-specific DOI if Dataverse contains one such as https://doi.org/10.7910/DVN/ARKOTI/IJPVOI. This removes the necessity for users to set the dataset argument.
  • get_dataframe_by_id() takes a numeric Dataverse identification number. This identifier is an internal number and is not prominently featured in the interface.

In addition to visual inspection, we can compare the UNF signatures for each dataset against what is reported by Dataverse to confirm that we received the correct files.

Retrieving Metadata

We may also want to retrieve some basic metadata about the dataset. The get_dataset() function lists all of the files in the dataset along with a considerable amount of metadata about each. (Recall that in Dataverse, dataset is a collection of files, not a single file.) We can see a quick glance at these files using:

dataset <- get_dataset("doi:10.7910/DVN/ARKOTI", server = "dataverse.harvard.edu")
dataset$files[c("filename", "contentType")]

This shows that there are indeed 32 files, a mix of .R code files and tab- and comma-separated data files.

You can also retrieve more extensive metadata using dataset_metadata():

str(dataset_metadata("10.7910/DVN/ARKOTI", server = "dataverse.harvard.edu"), 
    max.level = 2)
## List of 3
##  $ displayName: chr "Citation Metadata"
##  $ name       : chr "citation"
##  $ fields     :'data.frame': 7 obs. of  4 variables:
##   ..$ typeName : chr [1:7] "title" "author" "datasetContact" "dsDescription" ...
##   ..$ multiple : logi [1:7] FALSE TRUE TRUE TRUE TRUE FALSE ...
##   ..$ typeClass: chr [1:7] "primitive" "compound" "compound" "compound" ...
##   ..$ value    :List of 7

Retrieving Scripts and Other Files

If the file you want to retrieve is not data, you may want to use the more primitive function, get_file, which gets the file data as a raw binary file. See the help page examples of get_file() that use the base::writeBin() function for details on how to write and read these binary files instead.

code3 <- get_file("chapter03.R", "doi:10.7910/DVN/ARKOTI", server = "dataverse.harvard.edu")
writeBin(code3, "chapter03.R")