Graduate Group in Biostatistics.
Center for Computational Biology.
Bioinformatics. 2020 Jun 1;36(11):3422-3430. doi: 10.1093/bioinformatics/btaa176.
Statistical analyses of high-throughput sequencing data have re-shaped the biological sciences. In spite of myriad advances, recovering interpretable biological signal from data corrupted by technical noise remains a prevalent open problem. Several classes of procedures, among them classical dimensionality reduction techniques and others incorporating subject-matter knowledge, have provided effective advances. However, no procedure currently satisfies the dual objectives of recovering stable and relevant features simultaneously.
Inspired by recent proposals for making use of control data in the removal of unwanted variation, we propose a variant of principal component analysis (PCA), sparse contrastive PCA that extracts sparse, stable, interpretable and relevant biological signal. The new methodology is compared to competing dimensionality reduction approaches through a simulation study and via analyses of several publicly available protein expression, microarray gene expression and single-cell transcriptome sequencing datasets.
A free and open-source software implementation of the methodology, the scPCA R package, is made available via the Bioconductor Project. Code for all analyses presented in this article is also available via GitHub.
philippe_boileau@berkeley.edu.
Supplementary data are available at Bioinformatics online.
高通量测序数据的统计分析已经改变了生物科学。尽管取得了无数的进展,但从被技术噪声污染的数据中恢复可解释的生物学信号仍然是一个普遍存在的开放性问题。几类程序,包括经典的降维技术和其他结合主题知识的程序,都提供了有效的进展。然而,目前没有一种程序能够同时满足恢复稳定和相关特征的双重目标。
受最近提出的利用对照数据去除不需要的变化的启发,我们提出了一种主成分分析(PCA)的变体,稀疏对比 PCA,它可以提取稀疏、稳定、可解释和相关的生物学信号。通过模拟研究和对几个公开可用的蛋白质表达、微阵列基因表达和单细胞转录组测序数据集的分析,将新方法与竞争的降维方法进行了比较。
该方法的免费和开源软件实现,scPCA R 包,可通过 Bioconductor 项目获得。本文中所有分析的代码也可通过 GitHub 获得。
philippe_boileau@berkeley.edu。
补充数据可在生物信息学在线获得。