HDF5DataFrame

library(rhdf5)
library(HDF5Array)
## Loading required package: SparseArray
## Loading required package: Matrix
## Loading required package: BiocGenerics
## Loading required package: generics
## 
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
## 
##     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
##     setequal, union
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following object is masked from 'package:utils':
## 
##     data
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, saveRDS, scale, sequence, table,
##     tapply, transform, unique, unsplit, which.max, which.min
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:Matrix':
## 
##     expand, unname
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: S4Arrays
## Loading required package: abind
## Loading required package: IRanges
## 
## Attaching package: 'S4Arrays'
## The following object is masked from 'package:abind':
## 
##     abind
## The following object is masked from 'package:base':
## 
##     rowsum
## Loading required package: DelayedArray
## 
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:base':
## 
##     apply, sweep
## Loading required package: h5mread
## 
## Attaching package: 'h5mread'
## The following object is masked from 'package:rhdf5':
## 
##     h5ls
library(HDF5DataFrame)

Introduction

HDF5DataFrame is an R/Bioconductor package for HDF5-backed DataFrame objects and methods. Each column of a data frame is stored as a separate one dimensional array in an HDF5 file. HDF5DataFrame organizes these arrays and serves them as a DataFrame-like object. Common data frame methods such as subsetting and column binding in HDF5DataFrame are memory-efficient, thus these operations are performed lazily, making it suitable for dealing with large datasets.

Installation

You can install HDF5DataFrame from Bioconductor using BiocManager:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("HDF5DataFrame")

HDF5DataFrame

We use writeHDF5DataFrame to write a data.frame to an HDF5 file. You can also specify the HDF5 group (name) where columns of the data frame are stored.

# data
data("chickwts")
df <- chickwts

# create HDF5 and write an HDF5DataFrame
hdf5_file <- tempfile(fileext = ".h5")
df_hdf5 <- writeHDF5DataFrame(df, filepath = hdf5_file)
df_hdf5
## HDF5DataFrame with 71 rows and 2 columns
##                 weight               feed
##     <HDF5ColumnVector> <HDF5ColumnVector>
## 1                  179          horsebean
## 2                  160          horsebean
## 3                  136          horsebean
## 4                  227          horsebean
## 5                  217          horsebean
## ...                ...                ...
## 67                 359             casein
## 68                 216             casein
## 69                 222             casein
## 70                 283             casein
## 71                 332             casein

The path method points to the HDF5 file where the data frame is stored.

path(df_hdf5)
## [1] "/tmp/RtmpV4LdVq/filec4377ae0f14.h5"

Each column is stored as a separate one dimensional dataset in the HDF5 file.

h5ls(hdf5_file)
##   group   name       otype dclass dim
## 0     /   feed H5I_DATASET STRING  71
## 1     / weight H5I_DATASET  FLOAT  71

If the HDF5 file already contains a set of one-dimensional arrays with same length, one can construct an HDF5DataFrame object directly.

df_hdf5 <- HDF5DataFrame(filepath = hdf5_file)
df_hdf5
## HDF5DataFrame with 71 rows and 2 columns
##                   feed             weight
##     <HDF5ColumnVector> <HDF5ColumnVector>
## 1            horsebean                179
## 2            horsebean                160
## 3            horsebean                136
## 4            horsebean                227
## 5            horsebean                217
## ...                ...                ...
## 67              casein                359
## 68              casein                216
## 69              casein                222
## 70              casein                283
## 71              casein                332

HDF5DataFrame can also be written to a group under an HDF5 file. We use the name arguement to specify the group, thus HDF5DataFrame can be used along with other groups and arrays within the same HDF5 file.

hdf5_file <- tempfile(fileext = ".h5")
df_hdf5 <- writeHDF5DataFrame(df, 
                              filepath = hdf5_file, 
                              name = "df",
                              replace = TRUE)
df_hdf5
## HDF5DataFrame with 71 rows and 2 columns
##                 weight               feed
##     <HDF5ColumnVector> <HDF5ColumnVector>
## 1                  179          horsebean
## 2                  160          horsebean
## 3                  136          horsebean
## 4                  227          horsebean
## 5                  217          horsebean
## ...                ...                ...
## 67                 359             casein
## 68                 216             casein
## 69                 222             casein
## 70                 283             casein
## 71                 332             casein

Now if we list the datasets in the HDF5 file, DataFrame columns are stored under the df group.

h5ls(hdf5_file)
##   group   name       otype dclass dim
## 0     /     df   H5I_GROUP           
## 1   /df   feed H5I_DATASET STRING  71
## 2   /df weight H5I_DATASET  FLOAT  71

HDF5DataFrame Methods

You can use various DataFrame methods on HDF5DataFrame objects such as cbind where an in-memory data.frame can be combined with the HDF5DataFrame object.

# cbind with in memory data
df_hdf5_bind <- cbind(df_hdf5, df)
df_hdf5_bind
## DataFrame with 71 rows and 4 columns
##                 weight               feed    weight      feed
##     <HDF5ColumnVector> <HDF5ColumnVector> <numeric>  <factor>
## 1                  179          horsebean       179 horsebean
## 2                  160          horsebean       160 horsebean
## 3                  136          horsebean       136 horsebean
## 4                  227          horsebean       227 horsebean
## 5                  217          horsebean       217 horsebean
## ...                ...                ...       ...       ...
## 67                 359             casein       359    casein
## 68                 216             casein       216    casein
## 69                 222             casein       222    casein
## 70                 283             casein       283    casein
## 71                 332             casein       332    casein

Indexing in HDF5DataFrame is performed lazily without loading the data in the memory.

df_hdf5[1:10, ]
## DataFrame with 10 rows and 2 columns
##            weight           feed
##    <DelayedArray> <DelayedArray>
## 1             179      horsebean
## 2             160      horsebean
## 3             136      horsebean
## 4             227      horsebean
## 5             217      horsebean
## 6             168      horsebean
## 7             108      horsebean
## 8             124      horsebean
## 9             143      horsebean
## 10            140      horsebean

All variations of indexing can also be performed on HDF5DataFrame objects.

df_hdf5[1:10, 1:2]
## DataFrame with 10 rows and 2 columns
##            weight           feed
##    <DelayedArray> <DelayedArray>
## 1             179      horsebean
## 2             160      horsebean
## 3             136      horsebean
## 4             227      horsebean
## 5             217      horsebean
## 6             168      horsebean
## 7             108      horsebean
## 8             124      horsebean
## 9             143      horsebean
## 10            140      horsebean
df_hdf5[1:10, 1, drop = FALSE]
## DataFrame with 10 rows and 1 column
##            weight
##    <DelayedArray>
## 1             179
## 2             160
## 3             136
## 4             227
## 5             217
## 6             168
## 7             108
## 8             124
## 9             143
## 10            140

Selecting a single column would return the HDF5ColumnVector object dedicated to represent a single array of HDF5DataFrame saved in HDF5 file, which can also be loaded in memory at any time.

df_hdf5[, 1]
## <71> HDF5ColumnVector object of type "double":
##  [1]  [2]  [3]    . [70] [71] 
##  179  160  136    .  283  332
df_hdf5[["weight"]]
## <71> HDF5ColumnVector object of type "double":
##  [1]  [2]  [3]    . [70] [71] 
##  179  160  136    .  283  332
as.vector(df_hdf5[["weight"]])
##  [1] 179 160 136 227 217 168 108 124 143 140 309 229 181 141 260 203 148 169 213
## [20] 257 244 271 243 230 248 327 329 250 193 271 316 267 199 171 158 248 423 340
## [39] 392 339 341 226 320 295 334 322 297 318 325 257 303 315 380 153 263 242 206
## [58] 344 258 368 390 379 260 404 318 352 359 216 222 283 332

Coercion

Coercion methods between HDF5DataFrame and other data structures are defined.

  • from vector
as(df[["weight"]], "HDF5DataFrame")
## HDF5DataFrame with 71 rows and 1 column
##                      X
##     <HDF5ColumnVector>
## 1                  179
## 2                  160
## 3                  136
## 4                  227
## 5                  217
## ...                ...
## 67                 359
## 68                 216
## 69                 222
## 70                 283
## 71                 332
  • from list
as(list(a=1:5, b=6:10), "HDF5DataFrame")
## HDF5DataFrame with 5 rows and 2 columns
##                    a                  b
##   <HDF5ColumnVector> <HDF5ColumnVector>
## 1                  1                  6
## 2                  2                  7
## 3                  3                  8
## 4                  4                  9
## 5                  5                 10

You can also realize the HDF5DataFrame object into an in-memory data.frame any time.

a <- as.data.frame(df_hdf5)
a <- as(df_hdf5, "data.frame")
head(a)
##   weight      feed
## 1    179 horsebean
## 2    160 horsebean
## 3    136 horsebean
## 4    227 horsebean
## 5    217 horsebean
## 6    168 horsebean

Session info

## R version 4.6.1 (2026-06-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 26.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] HDF5DataFrame_0.99.3  HDF5Array_1.41.0      h5mread_1.5.0        
##  [4] DelayedArray_0.39.3   SparseArray_1.13.2    S4Arrays_1.13.0      
##  [7] IRanges_2.47.2        abind_1.4-8           S4Vectors_0.51.5     
## [10] MatrixGenerics_1.25.0 matrixStats_1.5.0     BiocGenerics_0.59.8  
## [13] generics_0.1.4        Matrix_1.7-5          rhdf5_2.57.1         
## [16] BiocStyle_2.41.0     
## 
## loaded via a namespace (and not attached):
##  [1] jsonlite_2.0.0      compiler_4.6.1      BiocManager_1.30.27
##  [4] rhdf5filters_1.25.0 jquerylib_0.1.4     yaml_2.3.12        
##  [7] fastmap_1.2.0       lattice_0.22-9      R6_2.6.1           
## [10] XVector_0.53.0      knitr_1.51          maketools_1.3.2    
## [13] bslib_0.11.0        rlang_1.2.0         cachem_1.1.0       
## [16] xfun_0.59           sass_0.4.10         sys_3.4.3          
## [19] otel_0.2.0          cli_3.6.6           Rhdf5lib_2.1.0     
## [22] digest_0.6.39       grid_4.6.1          lifecycle_1.0.5    
## [25] evaluate_1.0.5      buildtools_1.0.0    rmarkdown_2.31     
## [28] tools_4.6.1         htmltools_0.5.9