Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, The Netherlands.
Division of Molecular Carcinogenesis, The Netherlands Cancer Institute, Amsterdam, The Netherlands.
Brief Bioinform. 2019 Jan 18;20(1):317-329. doi: 10.1093/bib/bbx119.
Genome-wide measurements of genetic and epigenetic alterations are generating more and more high-dimensional binary data. The special mathematical characteristics of binary data make the direct use of the classical principal component analysis (PCA) model to explore low-dimensional structures less obvious. Although there are several PCA alternatives for binary data in the psychometric, data analysis and machine learning literature, they are not well known to the bioinformatics community. Results: In this article, we introduce the motivation and rationale of some parametric and nonparametric versions of PCA specifically geared for binary data. Using both realistic simulations of binary data as well as mutation, CNA and methylation data of the Genomic Determinants of Sensitivity in Cancer 1000 (GDSC1000), the methods were explored for their performance with respect to finding the correct number of components, overfit, finding back the correct low-dimensional structure, variable importance, etc. The results show that if a low-dimensional structure exists in the data, that most of the methods can find it. When assuming a probabilistic generating process is underlying the data, we recommend to use the parametric logistic PCA model, while when such an assumption is not valid and the data are considered as given, the nonparametric Gifi model is recommended.
The codes to reproduce the results in this article are available at the homepage of the Biosystems Data Analysis group (www.bdagroup.nl).
全基因组水平的遗传和表观遗传改变测量产生了越来越多的高维二进制数据。二进制数据的特殊数学特征使得直接使用经典的主成分分析(PCA)模型来探索低维结构不太明显。尽管心理测量学、数据分析和机器学习文献中有几种针对二进制数据的 PCA 替代方法,但生物信息学社区并不熟悉它们。结果:在本文中,我们介绍了一些专门针对二进制数据的参数和非参数 PCA 版本的动机和基本原理。使用二进制数据的真实模拟以及癌症敏感性的基因组决定因素 1000 (GDSC1000)的突变、CNA 和甲基化数据,我们探讨了这些方法在确定正确组件数量、过度拟合、找到正确的低维结构、变量重要性等方面的性能。结果表明,如果数据中存在低维结构,那么大多数方法都可以找到它。当假设数据的底层生成过程是概率性的时,我们建议使用参数逻辑 PCA 模型,而当这种假设不成立且数据被视为给定的时,建议使用非参数 Gifi 模型。
本文结果的重现代码可在生物系统数据分析小组(www.bdagroup.nl)的主页上获得。