Department of Computer Science, National Textile University, Faisalabad 37610, Pakistan.
Department of Medicine, Boston University School of Medicine, Boston, MA 02118, USA.
Bioinformatics. 2021 Sep 29;37(18):3058-3060. doi: 10.1093/bioinformatics/btab179.
R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance.
To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets.
ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset.
Supplementary data are available at Bioinformatics online.
R 实验对象,如 SummarizedExperiment 或 SingleCellExperiment,是用于存储一个或多个类似矩阵的检测结果以及相关的行和列数据的数据容器。这些对象已被用于促进从单细胞 RNA 测序等技术生成的高通量基因组数据的存储和分析。在许多基因组学分析工作流程中,一个常见的计算任务是在应用下游分析方法之前对数据矩阵进行子集划分。例如,可能需要排除检测矩阵列中的劣质样本,或者选择矩阵行中的最具变化的特征。传统上,会创建第二个对象,该对象包含原始对象中所需的检测子集。然而,这种方法效率低下,因为它需要创建一个包含原始检测副本的额外对象,并且会导致数据来源的挑战。
为了克服这些挑战,我们开发了一个名为 ExperimentSubset 的 R 包,它是一个数据容器,实现了用于高效存储和简化检索的类,这些类通过行和/或列对子集进行了子集划分。这些类能够通过维护子集中和父检测之间的关系来固有地提供数据来源。我们在单细胞 RNA-seq 数据集上演示了该包的实用性,通过在分析的不同阶段存储和检索子集,同时保持较低的内存占用。总体而言,ExperimentSubset 是高效管理子集的灵活容器。
ExperimentSubset 包可在 Bioconductor:https://bioconductor.org/packages/ExperimentSubset/ 和 Github:https://github.com/campbio/ExperimentSubset 上获得。
补充数据可在 Bioinformatics 在线获得。