cluster_graph_leiden(), cluster_graph_louvain() and cluster_graph_seurat() from snn to mat to more accurately reflect the input type. (pull request #292)cluster_cells_graph() that wraps the steps of knn object creation, graph adjacency creation, and clustering all within a single function (pull request #292)tile_width and normalization arguments to write_insertion_bedgraph() to allow for more flexible bedgraph creation (pull request #299)write_insertion_bed(), which originally was only a helper for peak calling (pull request #302).plot_embedding(), resulting from the way documentation examples use nested function calls (pull request #316).qc_scATAC() when fragments are near the start of a chromosome (pull request #320).pseudobulk_matrix(). Currently in progress in #268.The BPCells 0.3.1 release covers 7 months of changes and 40 commits from 5 contributors. Notable changes include writing matrices in AnnData's dense format, and methods for retrieving demo data for testing and examples. Full details of changes below.
Thanks to @ycli1995 and @mfansler for pull requests that contributed to this release, as well as to users who submitted github issues to help identify and fix bugs.
write_matrix_anndata_hdf5_dense() which allows writing matrices in AnnData's dense format, most commonly used for obsm or varm matrices. (Thanks to @ycli1995 for pull request #166)get_demo_mat(), get_demo_frags() and remove_demo_data() to retrieve a small test matrix/fragments object from the PBMC 3k dataset from 10X Genomics. (pull request #193)matrix_stats() now also works with types matrix and dgCMatrix. (pull request #190)writeInsertionBed() and writeInsertionBedGraph() (pull request #{118, 134})merge_peaks_iterative(), which helps create non-overlapping peak sets. (pull request #216)uint16_t when reading in anndata matrices using open_matrix_anndata_hdf5(). (pull request #248)write_matrix_10x_hdf5() to use signed rather than unsigned integers for indices, indptr, and shape to improve
compatibility with 10x-produced files. (Thanks to @ycli1995 for pull request #256)cbind() and rbind() when matrices are of different types, to upcast instead of erroring out. (pull request #265)call_peaks_macs() (pull request #175)gene_score_archr() and gene_score_weights_archr() malfunctioning for non-default tile_width settings. (Thanks to @Baboon61 for reporting issue #185)gene_score_archr() when chromosome_sizes argument is not sorted. (Thanks to @Baboon61 for reporting issue #188)devtools::load_all() and BiocGenerics has been imported previously. (pull request #191)write_insertion_bedgraph() (pull request #214)write_matrix_hdf5() when overwriting to a .h5 file that does not exist. (pull request #234)configure script to use a pre-installed libhwy if available during installation time. (Thanks to @mfansler for submitting PR #228)libhwy that is too old. (pull request #288, thanks to @GerardoZA for reporting issue #285)The BPCells 0.3.0 release covers 6 months of changes and 45 commits from 5 contributors. Notable improvements this release include support for peak calling with MACS and the addition of pseudobulk matrix and stats calculations. We also released an initial prototype of a BPCells Python library (more details here). Full details of changes below.
Thanks to @ycli1995, @Yunuuuu, and @douglasgscofield for pull requests that contributed to this release, as well as to users who sumitted github issues to help identify and fix bugs. We also added @immanuelazn to the team as a new hire! He is responsible for many of the new features this release and will continue to help with maintenance and new development moving forwards.
apply_by_col() and apply_by_row() allow providing custom R functions to compute per row/col summaries.
In initial tests calculating row/col means using R functions is ~2x slower than the C++-based implementation but memory
usage remains low.rowMaxs() and colMaxs() functions, which return the maximum value in each row or column of a matrix.
If matrixStats or MatrixGenerics packages are installed, BPCells::rowMaxs() will fall back to their implementations for non-BPCells objects.
Thanks to @immanuelazn for their first contribution as a new lab hire!regress_out() to allow removing unwanted sources of variation via least squares linear regression models.
Thanks to @ycli1995 for pull request #110trackplot_genome_annotation() for plotting peaks, with options for directional arrows, colors, labels, and peak widths. (pull request #113)call_peaks_macs()(pull request #118). Note, renamed from call_macs_peaks() in pull request #143rowQuantiles() and colQuantiles() functions, which return the quantiles of each row/column of a matrix. Currently rowQuantiles() only works on row-major matrices and colQuantiles() only works on col-major matrices.
If matrixStats or MatrixGenerics packages are installed, BPCells::colQuantiles() will fall back to their implementations for non-BPCells objects. (pull request #128)pseudobulk_matrix() which allows pseudobulk aggregation by sum or mean and calculation of per-pseudobulk variance and nonzero statistics for each gene (pull request #128)trackplot_loop() now accepts discrete color scalestrackplot_combine() now has smarter layout logic for margins, as well as detecting when plots are being combined that cover different genomic regions. (pull request #116)select_cells() and select_chromosomes() now also allow using a logical mask for selection. (pull request #117)LDFLAGS or CFLAGS as environment variables in addition to setting them in ~/.R/Makevars (pull request #124)open_matrix_anndata_hdf5() now supports reading AnnData matrices in the dense format. (pull request #146)cluster_graph_leiden() now has better defaults that produce reasonable cluster counts regardless of dataset size. (pull request #147)trackplot_coverage() with fragments from a single cluster. (Thanks to @sjessa for directly reporting this bug and coming up with a fix)trackplot_coverage() when called with ranges less than 500 bp in length (Thanks to @bettybliu for directly reporting this bug.)tile_matrix() with fragment mode (pull request #141)sctransform_pearson() on ARM architecture (pull request #141)pseudobulk_matrix() gets an integer matrix (pull request #174)trackplot_coverage() legend_label argument is now ignored, as the color legend is no longer shown by default for coverage plots.We are finally declaring a new release version, covering a large amount of changes and improvements
over the past year. Among the major features here are parallelization options for svds() and
matrix_stats(), improved genomic track plots, and runtime CPU feature detection for SIMD code (enables
higher performance, more portable builds). Full details of changes below.
This version also comes with a new installation path, which is done in preparation for a future Python package release. (So we can have one folder for R and one for Python, rather than having all the R files sit in the root folder). This is a breaking change and requires a slightly modified installation command.
Thanks to @brgew, @ycli1995, and @Yunuuuu for pull requests that contributed to this release, as well as all users who submitted github issues to help identify and fix bugs.
remotes::install_github("bnprks/BPCells/r") (note the additional /r)
"subdir": "r" to their packages.json config.all_matrix_inputs(). Outside of
loading old RDS files no changes should be needed.trackplot_gene() now returns a plot with a facet label to match the new trackplot system.
This label can be removed by by calling trackplot_gene(...) + ggplot2::facet_null() to be
equivalent to the old function's output.draw_trackplot_grid() deprecated, replaced by trackplot_combine() with simplified argumentstrackplot_bulk() has been deprecated, replaced by trackplot_coverage() with equivalent functionalitysvds() function, based on the excellent Spectra C++ library (used in RSpectra) by Yixuan Qiu.
This should ensure lower memory usage compared to irlba, while achieving similar speed + accuracy.threads argument to
matrix_stats() and svds().
marker_features() and writing a
matrix to disk remain single-threaded.svds() with many threads on gene-major matrices can result in high memory usage for now.
This problem is not present for cell-major matrices.import_matrix_market() and the convenience function import_matrix_market_10x(). Our
implementation uses disk-backed sorting to allow importing large files with low memory usage.binarize() function and associated generics <, <=, >, and >=.
This only supports comparison with non-negative numbers currently. (Thanks to
contribution from @brgew)round() matrix transformation (Thanks to contributions from @brgew)all_matrix_inputs() to help enable relocating
the underlying storage for BPCells matrix transform objects.gzip_level parameter, which will enable a shuffle + gzip filter for
compression. This is generally much slower than bitpacking compression, but it adds improved storage options for
files that must be read by outside programs. Thanks to @ycli1995 for submitting this improvement in pull #42.write_matrix_anndata_hdf5() (issue #49)m1[i,j] <- m2). Note that this does not modify data on disk. Instead,
it uses a series of subsetting and concatenation operations to provide the appearance of overwriting the appropriate
entries.knn_to_geodesic_graph(), which matches the Scanpy default construction for
graph-based clusteringchecksum(), which allows for calculating an MD5 checksum of a matrix contents. Thanks to @brgrew for submitting this improvement in pull request #83write_insertion_bedgraph() allows exporting pseudobulk insertion data to bedgraph formatc() now handles inputs with mismatched chromosome names.knn_to_snn_graph() should work more smoothly on large datasets due to C++ implementationmarker_features() for samples with millions of cells and a large number
of clusters to compare.[ now propagates through so we always avoid computing parts of
the peak/tile matrix that have been discarded by our subset. Subsetting a tile matrix will automatically
convert into a peak matrix when possible for improved efficiency.as.matrix() will produce integer matrices when appropriate (Thanks to @Yunuuuu in pull #77)trackplot_combine()trackplot_gene() now draws arrows for the direction of transcriptiontrackplot_loop() is a new track type allows plotting interactions between genomic regions, for instance peak-gene correlations
or loop calls from Hi-Ctrackplot_scalebar() is added to show genomic scaleset_trackplot_label() and set_trackplot_height()rowVars() and colVars() functions, as convenience wrappers around matrix_stats().
If matrixStats or MatrixGenerics packages are installed, BPCells::rowVars() will fall back to
their implementations for non-BPCells objects. Unfortunately, matrixStats::rowVars() is not generic, so either BPCells::rowVars() or
BPCells::colVars()highway.
Pow SIMD implementation is removed, but Square gets a new SIMD implementationlog1p(), and sctransform_pearson()chrNames(frags) <- val or cellNames(frags) <- val could cause
downstream errors.transpose_storage_order() for matrices with >4 billion non-zero entries.transpose_storage_order() for matrices with no non-zero entries.rownames() or colnames() is now propagated when saving matrices (Issue #29 reported thanks to @realzehuali, with an additional fix after report thanks to @Dario-Rocha)marker_features() for features with
more than 2.6 million zeros.convert_matrix_type() twice in a row such that it cancels out (e.g. double -> uint32_t -> double). Thanks to @brgrew reporting issue #43svds() not handling row-major matrices correctly. Thanks to @ycli1995 for reporting this in issue #55[<-. Thanks to @Yunuuuu for identifying the issue #67transpose_storage_order() on a densely-transformed matrix. Thanks to @Yunuuuu for reporting this in issue #71readRDS() can be used from different working directories.footprints() now respects user interrupts via Ctrl-C+, -, *, /, and log1p for streaming
normalization, along with other less common operations. This allows implementation of ATAC-seq LSI and Seurat default
normalization, along with most published log-based normalizations.Note: All operations interoperate with all storage formats. For example, all matrix operations can be applied directly to an AnnData or 10x matrix file. In many cases the bitpacking-compressed formats will provide performance/space advantages, but are not required to use the computations.