Epidemiology and Biostatistics, Graduate School of Public Health and Health Policy, City University of New York, New York, NY 10027, United States.
Institute for Implementation Science and Population Health, City University of New York, New York, NY 10027, United States.
Bioinformatics. 2023 Jun 1;39(6). doi: 10.1093/bioinformatics/btad330.
The RaggedExperiment R / Bioconductor package provides lossless representation of disparate genomic ranges across multiple specimens or cells, in conjunction with efficient and flexible calculations of rectangular-shaped summaries for downstream analysis. Applications include statistical analysis of somatic mutations, copy number, methylation, and open chromatin data. RaggedExperiment is compatible with multimodal data analysis as a component of MultiAssayExperiment data objects, and simplifies data representation and transformation for software developers and analysts.
Measurement of copy number, mutation, single nucleotide polymorphism, and other genomic attributes that may be stored as VCF files produce "ragged" genomic ranges data: i.e. across different genomic coordinates in each sample. Ragged data are not rectangular or matrix-like, presenting informatics challenges for downstream statistical analyses. We present the RaggedExperiment R/Bioconductor data structure for lossless representation of ragged genomic data, with associated reshaping tools for flexible and efficient calculation of tabular representations to support a wide range of downstream statistical analyses. We demonstrate its applicability to copy number and somatic mutation data across 33 TCGA cancer datasets.
RaggedExperiment R / Bioconductor 包提供了跨多个样本或细胞的不同基因组范围的无损表示,以及用于下游分析的高效灵活的矩形摘要计算。应用包括体细胞突变、拷贝数、甲基化和开放染色质数据的统计分析。RaggedExperiment 作为 MultiAssayExperiment 数据对象的一个组件,与多模式数据分析兼容,并简化了软件开发人员和分析人员的数据表示和转换。
测量拷贝数、突变、单核苷酸多态性和其他可能存储为 VCF 文件的基因组属性会产生“参差不齐”的基因组范围数据:即在每个样本中的不同基因组坐标上。参差不齐的数据不是矩形或矩阵状的,为下游统计分析带来了信息学挑战。我们提出了 RaggedExperiment R / Bioconductor 数据结构,用于无损表示参差不齐的基因组数据,并提供了相关的重塑工具,用于灵活高效地计算表格表示,以支持广泛的下游统计分析。我们证明了它在 33 个 TCGA 癌症数据集的拷贝数和体细胞突变数据中的适用性。