--- title: "`HDF5DataFrame`" date: "`r format(Sys.Date(), '%B %d, %Y')`" package: "`r BiocStyle::pkg_ver('HDF5DataFrame')`" author: - name: Artur Manukyan output: BiocStyle::html_document vignette: | %\VignetteIndexEntry{HDF5DataFrame} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r load-libs} library(rhdf5) library(HDF5Array) library(HDF5DataFrame) ``` # Introduction `r Biocpkg("HDF5DataFrame")` is an R/Bioconductor package for HDF5-backed DataFrame objects and methods. Each column of a data frame is stored as a separate one dimensional array in an HDF5 file. `r Biocpkg("HDF5DataFrame")` organizes these arrays and serves them as a `DataFrame`-like object. Common data frame methods such as subsetting and column binding in `HDF5DataFrame` are memory-efficient, thus these operations are performed lazily, making it suitable for dealing with large datasets. # Installation You can install `r Biocpkg("HDF5DataFrame")` from Bioconductor using `r CRANpkg("BiocManager")`: ```{r bioc, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("HDF5DataFrame") ``` # HDF5DataFrame We use `writeHDF5DataFrame` to write a `data.frame` to an HDF5 file. You can also specify the HDF5 group (`name`) where columns of the data frame are stored. ```{r read} # data data("chickwts") df <- chickwts # create HDF5 and write an HDF5DataFrame hdf5_file <- tempfile(fileext = ".h5") df_hdf5 <- writeHDF5DataFrame(df, filepath = hdf5_file) df_hdf5 ``` The `path` method points to the HDF5 file where the data frame is stored. ```{r path} path(df_hdf5) ``` Each column is stored as a separate one dimensional dataset in the HDF5 file. ```{r h5ls} h5ls(hdf5_file) ``` If the HDF5 file already contains a set of one-dimensional arrays with same length, one can construct an `HDF5DataFrame` object directly. ```{r HDF5DataFrame} df_hdf5 <- HDF5DataFrame(filepath = hdf5_file) df_hdf5 ``` `r Biocpkg("HDF5DataFrame")` can also be written to a group under an HDF5 file. We use the `name` arguement to specify the group, thus `r Biocpkg("HDF5DataFrame")` can be used along with other groups and arrays within the same HDF5 file. ```{r group} hdf5_file <- tempfile(fileext = ".h5") df_hdf5 <- writeHDF5DataFrame(df, filepath = hdf5_file, name = "df", replace = TRUE) df_hdf5 ``` Now if we list the datasets in the HDF5 file, DataFrame columns are stored under the `df` group. ```{r h5lsgroup} h5ls(hdf5_file) ``` # HDF5DataFrame Methods You can use various `DataFrame` methods on `HDF5DataFrame` objects such as `cbind` where an in-memory `data.frame` can be combined with the `HDF5DataFrame` object. ```{r methods} # cbind with in memory data df_hdf5_bind <- cbind(df_hdf5, df) df_hdf5_bind ``` Indexing in `HDF5DataFrame` is performed lazily without loading the data in the memory. ```{r lazy} df_hdf5[1:10, ] ``` All variations of indexing can also be performed on `HDF5DataFrame` objects. ```{r lazy_other} df_hdf5[1:10, 1:2] df_hdf5[1:10, 1, drop = FALSE] ``` Selecting a single column would return the `HDF5ColumnVector` object dedicated to represent a single array of `HDF5DataFrame` saved in HDF5 file, which can also be loaded in memory at any time. ```{r lazy_other2} df_hdf5[, 1] df_hdf5[["weight"]] as.vector(df_hdf5[["weight"]]) ``` # Coercion Coercion methods between `HDF5DataFrame` and other data structures are defined. * from vector ```{r coercion} as(df[["weight"]], "HDF5DataFrame") ``` * from list ```{r coercion2} as(list(a=1:5, b=6:10), "HDF5DataFrame") ``` You can also realize the `HDF5DataFrame` object into an in-memory `data.frame` any time. ```{r methods2} a <- as.data.frame(df_hdf5) a <- as(df_hdf5, "data.frame") head(a) ``` # Session info ```{r sessionInfo, echo=FALSE} sessionInfo() ```