HDF5DataFrame## Loading required package: SparseArray
## Loading required package: Matrix
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following object is masked from 'package:utils':
##
## data
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, saveRDS, scale, sequence, table,
## tapply, transform, unique, unsplit, which.max, which.min
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:Matrix':
##
## expand, unname
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: S4Arrays
## Loading required package: abind
## Loading required package: IRanges
##
## Attaching package: 'S4Arrays'
## The following object is masked from 'package:abind':
##
## abind
## The following object is masked from 'package:base':
##
## rowsum
## Loading required package: DelayedArray
##
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:base':
##
## apply, sweep
## Loading required package: h5mread
##
## Attaching package: 'h5mread'
## The following object is masked from 'package:rhdf5':
##
## h5ls
HDF5DataFrame
is an R/Bioconductor package for HDF5-backed DataFrame objects and
methods. Each column of a data frame is stored as a separate one
dimensional array in an HDF5 file. HDF5DataFrame
organizes these arrays and serves them as a DataFrame-like
object. Common data frame methods such as subsetting and column binding
in HDF5DataFrame are memory-efficient, thus these
operations are performed lazily, making it suitable for dealing with
large datasets.
You can install HDF5DataFrame from Bioconductor using BiocManager:
We use writeHDF5DataFrame to write a
data.frame to an HDF5 file. You can also specify the HDF5
group (name) where columns of the data frame are
stored.
# data
data("chickwts")
df <- chickwts
# create HDF5 and write an HDF5DataFrame
hdf5_file <- tempfile(fileext = ".h5")
df_hdf5 <- writeHDF5DataFrame(df, filepath = hdf5_file)
df_hdf5## HDF5DataFrame with 71 rows and 2 columns
## weight feed
## <HDF5ColumnVector> <HDF5ColumnVector>
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## ... ... ...
## 67 359 casein
## 68 216 casein
## 69 222 casein
## 70 283 casein
## 71 332 casein
The path method points to the HDF5 file where the data
frame is stored.
## [1] "/tmp/RtmpV4LdVq/filec4377ae0f14.h5"
Each column is stored as a separate one dimensional dataset in the HDF5 file.
## group name otype dclass dim
## 0 / feed H5I_DATASET STRING 71
## 1 / weight H5I_DATASET FLOAT 71
If the HDF5 file already contains a set of one-dimensional arrays
with same length, one can construct an HDF5DataFrame object
directly.
## HDF5DataFrame with 71 rows and 2 columns
## feed weight
## <HDF5ColumnVector> <HDF5ColumnVector>
## 1 horsebean 179
## 2 horsebean 160
## 3 horsebean 136
## 4 horsebean 227
## 5 horsebean 217
## ... ... ...
## 67 casein 359
## 68 casein 216
## 69 casein 222
## 70 casein 283
## 71 casein 332
HDF5DataFrame
can also be written to a group under an HDF5 file. We use the
name arguement to specify the group, thus HDF5DataFrame
can be used along with other groups and arrays within the same HDF5
file.
hdf5_file <- tempfile(fileext = ".h5")
df_hdf5 <- writeHDF5DataFrame(df,
filepath = hdf5_file,
name = "df",
replace = TRUE)
df_hdf5## HDF5DataFrame with 71 rows and 2 columns
## weight feed
## <HDF5ColumnVector> <HDF5ColumnVector>
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## ... ... ...
## 67 359 casein
## 68 216 casein
## 69 222 casein
## 70 283 casein
## 71 332 casein
Now if we list the datasets in the HDF5 file, DataFrame columns are
stored under the df group.
## group name otype dclass dim
## 0 / df H5I_GROUP
## 1 /df feed H5I_DATASET STRING 71
## 2 /df weight H5I_DATASET FLOAT 71
You can use various DataFrame methods on
HDF5DataFrame objects such as cbind where an
in-memory data.frame can be combined with the
HDF5DataFrame object.
## DataFrame with 71 rows and 4 columns
## weight feed weight feed
## <HDF5ColumnVector> <HDF5ColumnVector> <numeric> <factor>
## 1 179 horsebean 179 horsebean
## 2 160 horsebean 160 horsebean
## 3 136 horsebean 136 horsebean
## 4 227 horsebean 227 horsebean
## 5 217 horsebean 217 horsebean
## ... ... ... ... ...
## 67 359 casein 359 casein
## 68 216 casein 216 casein
## 69 222 casein 222 casein
## 70 283 casein 283 casein
## 71 332 casein 332 casein
Indexing in HDF5DataFrame is performed lazily without
loading the data in the memory.
## DataFrame with 10 rows and 2 columns
## weight feed
## <DelayedArray> <DelayedArray>
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## 6 168 horsebean
## 7 108 horsebean
## 8 124 horsebean
## 9 143 horsebean
## 10 140 horsebean
All variations of indexing can also be performed on
HDF5DataFrame objects.
## DataFrame with 10 rows and 2 columns
## weight feed
## <DelayedArray> <DelayedArray>
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## 6 168 horsebean
## 7 108 horsebean
## 8 124 horsebean
## 9 143 horsebean
## 10 140 horsebean
## DataFrame with 10 rows and 1 column
## weight
## <DelayedArray>
## 1 179
## 2 160
## 3 136
## 4 227
## 5 217
## 6 168
## 7 108
## 8 124
## 9 143
## 10 140
Selecting a single column would return the
HDF5ColumnVector object dedicated to represent a single
array of HDF5DataFrame saved in HDF5 file, which can also
be loaded in memory at any time.
## <71> HDF5ColumnVector object of type "double":
## [1] [2] [3] . [70] [71]
## 179 160 136 . 283 332
## <71> HDF5ColumnVector object of type "double":
## [1] [2] [3] . [70] [71]
## 179 160 136 . 283 332
## [1] 179 160 136 227 217 168 108 124 143 140 309 229 181 141 260 203 148 169 213
## [20] 257 244 271 243 230 248 327 329 250 193 271 316 267 199 171 158 248 423 340
## [39] 392 339 341 226 320 295 334 322 297 318 325 257 303 315 380 153 263 242 206
## [58] 344 258 368 390 379 260 404 318 352 359 216 222 283 332
Coercion methods between HDF5DataFrame and other data
structures are defined.
## HDF5DataFrame with 71 rows and 1 column
## X
## <HDF5ColumnVector>
## 1 179
## 2 160
## 3 136
## 4 227
## 5 217
## ... ...
## 67 359
## 68 216
## 69 222
## 70 283
## 71 332
## HDF5DataFrame with 5 rows and 2 columns
## a b
## <HDF5ColumnVector> <HDF5ColumnVector>
## 1 1 6
## 2 2 7
## 3 3 8
## 4 4 9
## 5 5 10
You can also realize the HDF5DataFrame object into an
in-memory data.frame any time.
## weight feed
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## 6 168 horsebean
## R version 4.6.1 (2026-06-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 26.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] HDF5DataFrame_0.99.3 HDF5Array_1.41.0 h5mread_1.5.0
## [4] DelayedArray_0.39.3 SparseArray_1.13.2 S4Arrays_1.13.0
## [7] IRanges_2.47.2 abind_1.4-8 S4Vectors_0.51.5
## [10] MatrixGenerics_1.25.0 matrixStats_1.5.0 BiocGenerics_0.59.8
## [13] generics_0.1.4 Matrix_1.7-5 rhdf5_2.57.1
## [16] BiocStyle_2.41.0
##
## loaded via a namespace (and not attached):
## [1] jsonlite_2.0.0 compiler_4.6.1 BiocManager_1.30.27
## [4] rhdf5filters_1.25.0 jquerylib_0.1.4 yaml_2.3.12
## [7] fastmap_1.2.0 lattice_0.22-9 R6_2.6.1
## [10] XVector_0.53.0 knitr_1.51 maketools_1.3.2
## [13] bslib_0.11.0 rlang_1.2.0 cachem_1.1.0
## [16] xfun_0.59 sass_0.4.10 sys_3.4.3
## [19] otel_0.2.0 cli_3.6.6 Rhdf5lib_2.1.0
## [22] digest_0.6.39 grid_4.6.1 lifecycle_1.0.5
## [25] evaluate_1.0.5 buildtools_1.0.0 rmarkdown_2.31
## [28] tools_4.6.1 htmltools_0.5.9