Department of Preventive Medicine, CA 90033, USA.
Department of Medicine, Division of Medical Oncology, Norris Comprehensive Cancer Center, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA.
Bioinformatics. 2020 Feb 1;36(3):676-681. doi: 10.1093/bioinformatics/btz661.
Large amounts of information generated by genomic technologies are accompanied by statistical and computational challenges due to redundancy, badly behaved data and noise. Dimensionality reduction (DR) methods have been developed to mitigate these challenges. However, many approaches are not scalable to large dimensions or result in excessive information loss.
The proposed approach partitions data into subsets of related features and summarizes each into one and only one new feature, thus defining a surjective mapping. A constraint on information loss determines the size of the reduced dataset. Simulation studies demonstrate that when multiple related features are associated with a response, this approach can substantially increase the number of true associations detected as compared to principal components analysis, non-negative matrix factorization or no DR. This increase in true discoveries is explained both by a reduced multiple-testing challenge and a reduction in extraneous noise. In an application to real data collected from metastatic colorectal cancer tumors, more associations between gene expression features and progression free survival and response to treatment were detected in the reduced than in the full untransformed dataset.
Freely available R package from CRAN, https://cran.r-project.org/package=partition.
Supplementary data are available at Bioinformatics online.
由于冗余、数据质量差和噪声等问题,基因组技术产生的大量信息伴随着统计和计算方面的挑战。降维(DR)方法的发展是为了缓解这些挑战。然而,许多方法无法扩展到大规模维度,或者会导致过多的信息丢失。
所提出的方法将数据划分为相关特征的子集,并将每个子集总结为一个且仅一个新特征,从而定义了一个满射映射。信息丢失的约束确定了降维数据集的大小。模拟研究表明,当多个相关特征与响应相关时,与主成分分析、非负矩阵分解或无 DR 相比,该方法可以大大增加检测到的真实关联数量。这种真实发现的增加既可以通过减少多重检验挑战来解释,也可以通过减少无关噪声来解释。在对转移性结直肠癌肿瘤中收集的真实数据的应用中,在降维后数据集而不是完整的未转换数据集中检测到了更多基因表达特征与无进展生存期和对治疗的反应之间的关联。
可从 CRAN 上的免费 R 包获得,https://cran.r-project.org/package=partition。
补充资料可在生物信息学在线获得。