This vignette shows how to download data from Dataverse using the dataverse package. We’ll focus on a Dataverse repository that contains supplemental files for the book Political Analysis Using R (2015), which is stored at Harvard’s Dataverse Server (https://dataverse.harvard.edu).
The Dataverse entry for this study is persistently retrievable by a
“Digital Object Identifier (DOI)”: https://doi.org/10.7910/DVN/ARKOTI and the citation on
the Dataverse Page includes a “Universal
Numeric Fingerprint (UNF)”:
UNF:6:+itU9hcUJ8I9E0Kqv8HWHg==
, which provides a versioned,
multi-file hash for the entire study, which contains 32 files.
The following examples will draw from the Harvard Dataverse, so it is convenient to set this as a default environment variable.
This is equivalent to setting
server = "dataverse.harvard.edu"
in every
dataverse
function each time. Note that if you set an
environment variable like the above, that operation is necessary to make
your code reproducible on a different machine.
For downloading a public dataset, no API Key is needed.
We will download public data files and examine them directly in R using the dataverse package.
First, we retrieve a plain-text file like this dataset on electricity consumption by Wakiyama et al. (2014). Taking the file name and dataset DOI from this entry,
energy <- get_dataframe_by_name(
filename = "comprehensiveJapanEnergy.tab",
dataset = "10.7910/DVN/ARKOTI",
server = "dataverse.harvard.edu")
## # A tibble: 6 × 10
## time date dummy temp temp2 all large house kepco tepco
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 8-Jan 0 5.9 34.8 95792389 35194957 26190714 13357735 26960899
## 2 2 8-Feb 0 5.5 30.3 95156901 35322031 24224097 13315027 27189705
## 3 3 8-Mar 0 10.7 114. 91034047 36474192 21391965 12805831 24495519
## 4 4 8-Apr 0 14.7 216. 84087552 34949622 18494473 11494328 23540356
## 5 5 8-May 0 18.5 342. 82742929 35417089 17923760 11589061 22848737
## 6 6 8-Jun 0 21.3 454. 82180013 36692291 15205229 11360771 22487441
These get_dataframe_*
functions, introduced in v0.3.0,
directly read in the data into a R environment through whatever R
function supplied by .f
. The default of the
get_dataframe_*
functions is to read in such data by
readr::read_tsv()
. The .f
function can be
modified to modify the read-in settings. For example, the following
modification is a base-R equivalent to read in the ingested data.
library(readr)
energy <- get_dataframe_by_name(
filename = "comprehensiveJapanEnergy.tab",
dataset = "10.7910/DVN/ARKOTI",
server = "dataverse.harvard.edu",
.f = function(x) read.delim(x, sep = "\t"))
head(energy)
## time date dummy temp temp2 all large house kepco tepco
## 1 1 8-Jan 0 5.9 34.8 95792389 35194957 26190714 13357735 26960899
## 2 2 8-Feb 0 5.5 30.3 95156901 35322031 24224097 13315027 27189705
## 3 3 8-Mar 0 10.7 114.5 91034047 36474192 21391965 12805831 24495519
## 4 4 8-Apr 0 14.7 216.1 84087552 34949622 18494473 11494328 23540356
## 5 5 8-May 0 18.5 342.3 82742929 35417089 17923760 11589061 22848737
## 6 6 8-Jun 0 21.3 453.7 82180013 36692291 15205229 11360771 22487441
The dataverse package can also download datasets that are drafts (i.e. versions not released publicly), as long as the user of the dataset provides their appropriate DATAVERSE_KEY. Users may need to modify the metadata of a datafile, such as adding a descriptive label, for the data downloading to work properly in this case. This is because the the file identifier UNF, which the read function relies on, may only appear after metadata has been added.
As of v0.3.15, datasets are cached on your computer if the
user specifies a version of the dataset. The next time the code is run,
the function will read from the cache rather than re-downloading from
the Dataverse. Version specification can be done, e.g., by setting
version = "3"
for V3, for instance. This is useful to avoid
re-downloading the identical dataset every time, especially if they take
some time to download. To turn off or view the settings of caching, see
cache_dataset()
.
If a file is displayed on dataverse as a .tab
file like
the survey data by Alvarez et
al. (2013), it is likely that Dataverse ingested
the file to a plain-text, tab-delimited format.
argentina_tab <- get_dataframe_by_name(
filename = "alpl2013.tab",
dataset = "10.7910/DVN/ARKOTI",
server = "dataverse.harvard.edu")
However, ingested files may not retain important dataset attributes.
For example, Stata and SPSS datasets encode value labels on to numeric
values. Factor variables in R dataframes encode levels, not only labels.
A plain-text ingested file will discard such information. For example,
the polling_place
variable in this data is only given by
numbers, although the original data labelled these numbers with
informative values.
When ingesting, Dataverse retains a original
version
that retains these attributes but may not be readable in some platforms.
The get_dataframe_*
functions have an argument that can be
set to original = TRUE
. In this case we know that
alpl2013.tab
was originally a Stata dta file, so we can
run:
argentina_dta <- get_dataframe_by_name(
filename = "alpl2013.tab",
dataset = "10.7910/DVN/ARKOTI",
server = "dataverse.harvard.edu",
original = TRUE,
.f = haven::read_dta)
Now we see that labels are read in through haven
’s
labelled variables class:
## dbl+lbl [1:1475] 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 3...
## @ label : chr "polling_place"
## @ format.stata: chr "%9.0g"
## @ labels : Named num [1:37] 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "names")= chr [1:37] "E.E.T." "Escuela Juan Bautista Alberdi" "Escuela Juan Carlos Dávalos" "Escuela Bernardino de Rivadavia" ...
Users should pick .f
and original
based on
their existing knowledge of the file. If the original file is a
.sav
SPSS file, .f
can be
haven::read_sav
. If it is a .Rds
file, use
readRDS
or readr::read_rds
. In fact, because
the raw data is read in as a binary, there is no limitation to the file
types get_dataframe_*
can read in, as far as the dataverse
package is concerned.
There are two more ways to read in a dataframe other than
get_dataframe_by_name()
.
get_dataframe_by_doi()
takes in a file-specific DOI if
Dataverse contains one such as https://doi.org/10.7910/DVN/ARKOTI/IJPVOI. This removes
the necessity for users to set the dataset
argument.get_dataframe_by_id()
takes a numeric Dataverse
identification number. This identifier is an internal number and is not
prominently featured in the interface.In addition to visual inspection, we can compare the UNF signatures for each dataset against what is reported by Dataverse to confirm that we received the correct files.
We may also want to retrieve some basic metadata about the dataset.
The get_dataset()
function lists all of the files in the
dataset along with a considerable amount of metadata about each. (Recall
that in Dataverse, dataset
is a collection of files, not a
single file.) We can see a quick glance at these files using:
dataset <- get_dataset("doi:10.7910/DVN/ARKOTI", server = "dataverse.harvard.edu")
dataset$files[c("filename", "contentType")]
This shows that there are indeed 32 files, a mix of .R code files and tab- and comma-separated data files.
You can also retrieve more extensive metadata using
dataset_metadata()
:
## List of 3
## $ displayName: chr "Citation Metadata"
## $ name : chr "citation"
## $ fields :'data.frame': 7 obs. of 4 variables:
## ..$ typeName : chr [1:7] "title" "author" "datasetContact" "dsDescription" ...
## ..$ multiple : logi [1:7] FALSE TRUE TRUE TRUE TRUE FALSE ...
## ..$ typeClass: chr [1:7] "primitive" "compound" "compound" "compound" ...
## ..$ value :List of 7
If the file you want to retrieve is not data, you may want to use the
more primitive function, get_file
, which gets the file data
as a raw binary file. See the help page examples of
get_file()
that use the base::writeBin()
function for details on how to write and read these binary files
instead.