--- title: "3. Data Download" date: "`r Sys.Date()`" output: html_vignette: fig_caption: false vignette: > %\VignetteIndexEntry{3. Data Download} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r knitr_options, echo=FALSE, results="hide"} options(width = 120) knitr::opts_chunk$set(results = "hold") ``` This vignette shows how to download data from Dataverse using the dataverse package. We'll focus on a Dataverse repository that contains supplemental files for the book _Political Analysis Using R_ (2015), which is stored at Harvard's Dataverse Server (). The Dataverse entry for this study is persistently retrievable by a "Digital Object Identifier (DOI)": https://doi.org/10.7910/DVN/ARKOTI and the citation on the Dataverse Page includes a "[Universal Numeric Fingerprint (UNF)](https://guides.dataverse.org/en/latest/developers/unf/index.html)": `UNF:6:+itU9hcUJ8I9E0Kqv8HWHg==`, which provides a versioned, multi-file hash for the entire study, which contains 32 files. The following examples will draw from the Harvard Dataverse, so it is convenient to set this as a default environment variable. ```{r} Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu") ``` This is equivalent to setting `server = "dataverse.harvard.edu"` in every `dataverse` function each time. Note that if you set an environment variable like the above, that operation is necessary to make your code reproducible on a different machine. For downloading a public dataset, _no API Key is needed_. ## Retrieving Plain-Text Data We will download public data files and examine them directly in R using the **dataverse** package. ```{r} library("dataverse") library("tibble") # to see dataframes in tidyverse-form ``` First, we retrieve a plain-text file like this dataset on electricity consumption by [Wakiyama et al. (2014)](https://doi.org/10.7910/DVN/ARKOTI/GN1MRT). Taking the file name and dataset DOI from this entry, ```{r, eval=FALSE} energy <- get_dataframe_by_name( filename = "comprehensiveJapanEnergy.tab", dataset = "10.7910/DVN/ARKOTI", server = "dataverse.harvard.edu") ``` ```{r, eval=FALSE} head(energy) ``` ```{r} ## # A tibble: 6 × 10 ## time date dummy temp temp2 all large house kepco tepco ## ## 1 1 8-Jan 0 5.9 34.8 95792389 35194957 26190714 13357735 26960899 ## 2 2 8-Feb 0 5.5 30.3 95156901 35322031 24224097 13315027 27189705 ## 3 3 8-Mar 0 10.7 114. 91034047 36474192 21391965 12805831 24495519 ## 4 4 8-Apr 0 14.7 216. 84087552 34949622 18494473 11494328 23540356 ## 5 5 8-May 0 18.5 342. 82742929 35417089 17923760 11589061 22848737 ## 6 6 8-Jun 0 21.3 454. 82180013 36692291 15205229 11360771 22487441 ``` These `get_dataframe_*` functions, introduced in v0.3.0, directly read in the data into a R environment through whatever R function supplied by `.f`. The default of the `get_dataframe_*` functions is to read in such data by `readr::read_tsv()`. The `.f` function can be modified to modify the read-in settings. For example, the following modification is a base-R equivalent to read in the ingested data. ```{r, eval=FALSE} library(readr) energy <- get_dataframe_by_name( filename = "comprehensiveJapanEnergy.tab", dataset = "10.7910/DVN/ARKOTI", server = "dataverse.harvard.edu", .f = function(x) read.delim(x, sep = "\t")) head(energy) ``` ```{r} ## time date dummy temp temp2 all large house kepco tepco ## 1 1 8-Jan 0 5.9 34.8 95792389 35194957 26190714 13357735 26960899 ## 2 2 8-Feb 0 5.5 30.3 95156901 35322031 24224097 13315027 27189705 ## 3 3 8-Mar 0 10.7 114.5 91034047 36474192 21391965 12805831 24495519 ## 4 4 8-Apr 0 14.7 216.1 84087552 34949622 18494473 11494328 23540356 ## 5 5 8-May 0 18.5 342.3 82742929 35417089 17923760 11589061 22848737 ## 6 6 8-Jun 0 21.3 453.7 82180013 36692291 15205229 11360771 22487441 ``` The dataverse package can also download datasets that are _drafts_ (i.e. versions not released publicly), as long as the user of the dataset provides their appropriate DATAVERSE_KEY. Users may need to modify the metadata of a datafile, such as adding a descriptive label, for the data downloading to work properly in this case. This is because the the file identifier UNF, which the read function relies on, may only appear after metadata has been added. ## Caching large datasets As of v0.3.15, datasets are _cached_ on your computer if the user specifies a version of the dataset. The next time the code is run, the function will read from the cache rather than re-downloading from the Dataverse. Version specification can be done, e.g., by setting `version = "3"` for V3, for instance. This is useful to avoid re-downloading the identical dataset every time, especially if they take some time to download. To turn off or view the settings of caching, see `cache_dataset()`. ## Retrieving Custom Data Formats (RDS, Stata, SPSS) If a file is displayed on dataverse as a `.tab` file like the survey data by [Alvarez et al. (2013)](https://doi.org/10.7910/DVN/ARKOTI/A8YRMP), it is likely that Dataverse [ingested](https://guides.dataverse.org/en/latest/user/tabulardataingest/index.html) the file to a plain-text, tab-delimited format. ```{r, message=FALSE,eval=FALSE} argentina_tab <- get_dataframe_by_name( filename = "alpl2013.tab", dataset = "10.7910/DVN/ARKOTI", server = "dataverse.harvard.edu") ``` However, ingested files may not retain important dataset attributes. For example, Stata and SPSS datasets encode value labels on to numeric values. Factor variables in R dataframes encode levels, not only labels. A plain-text ingested file will discard such information. For example, the `polling_place` variable in this data is only given by numbers, although the original data labelled these numbers with informative values. ```{r,eval=FALSE} str(argentina_tab$polling_place) ``` ```{r} ## num [1:1475] 31 31 31 31 31 31 31 31 31 31 ... ``` When ingesting, Dataverse retains a `original` version that retains these attributes but may not be readable in some platforms. The `get_dataframe_*` functions have an argument that can be set to `original = TRUE`. In this case we know that `alpl2013.tab` was originally a Stata dta file, so we can run: ```{r, eval=FALSE} argentina_dta <- get_dataframe_by_name( filename = "alpl2013.tab", dataset = "10.7910/DVN/ARKOTI", server = "dataverse.harvard.edu", original = TRUE, .f = haven::read_dta) ``` Now we see that labels are read in through `haven`'s labelled variables class: ```{r, eval=FALSE} str(argentina_dta$polling_place) ``` ```{r} ## dbl+lbl [1:1475] 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 3... ## @ label : chr "polling_place" ## @ format.stata: chr "%9.0g" ## @ labels : Named num [1:37] 1 2 3 4 5 6 7 8 9 10 ... ## ..- attr(*, "names")= chr [1:37] "E.E.T." "Escuela Juan Bautista Alberdi" "Escuela Juan Carlos Dávalos" "Escuela Bernardino de Rivadavia" ... ``` Users should pick `.f` and `original` based on their existing knowledge of the file. If the original file is a `.sav` SPSS file, `.f` can be `haven::read_sav`. If it is a `.Rds` file, use `readRDS` or `readr::read_rds`. In fact, because the raw data is read in as a binary, there is no limitation to the file types `get_dataframe_*` can read in, as far as the dataverse package is concerned. There are two more ways to read in a dataframe other than `get_dataframe_by_name()`. * `get_dataframe_by_doi()` takes in a file-specific DOI if Dataverse contains one such as . This removes the necessity for users to set the `dataset` argument. * `get_dataframe_by_id()` takes a numeric Dataverse identification number. This identifier is an internal number and is not prominently featured in the interface. In addition to visual inspection, we can compare the UNF signatures for each dataset against what is reported by Dataverse to confirm that we received the correct files. ## Retrieving Metadata We may also want to retrieve some basic metadata about the dataset. The `get_dataset()` function lists all of the files in the dataset along with a considerable amount of metadata about each. (Recall that in Dataverse, `dataset` is a collection of files, not a single file.) We can see a quick glance at these files using: ```r dataset <- get_dataset("doi:10.7910/DVN/ARKOTI", server = "dataverse.harvard.edu") dataset$files[c("filename", "contentType")] ``` This shows that there are indeed 32 files, a mix of .R code files and tab- and comma-separated data files. You can also retrieve more extensive metadata using `dataset_metadata()`: ```{r, eval=FALSE} str(dataset_metadata("10.7910/DVN/ARKOTI", server = "dataverse.harvard.edu"), max.level = 2) ``` ```{r} ## List of 3 ## $ displayName: chr "Citation Metadata" ## $ name : chr "citation" ## $ fields :'data.frame': 7 obs. of 4 variables: ## ..$ typeName : chr [1:7] "title" "author" "datasetContact" "dsDescription" ... ## ..$ multiple : logi [1:7] FALSE TRUE TRUE TRUE TRUE FALSE ... ## ..$ typeClass: chr [1:7] "primitive" "compound" "compound" "compound" ... ## ..$ value :List of 7 ``` ## Retrieving Scripts and Other Files If the file you want to retrieve is not data, you may want to use the more primitive function, `get_file`, which gets the file data as a raw binary file. See the help page examples of `get_file()` that use the `base::writeBin()` function for details on how to write and read these binary files instead. ```{r, eval = FALSE} code3 <- get_file("chapter03.R", "doi:10.7910/DVN/ARKOTI", server = "dataverse.harvard.edu") writeBin(code3, "chapter03.R") ```