二元基因组数据的主成分分析。

Principal component analysis of binary genomics data.

机构信息

Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, The Netherlands.

Division of Molecular Carcinogenesis, The Netherlands Cancer Institute, Amsterdam, The Netherlands.

出版信息

Brief Bioinform. 2019 Jan 18;20(1):317-329. doi: 10.1093/bib/bbx119.

DOI:10.1093/bib/bbx119

PMID:30657888

Abstract

MOTIVATION

Genome-wide measurements of genetic and epigenetic alterations are generating more and more high-dimensional binary data. The special mathematical characteristics of binary data make the direct use of the classical principal component analysis (PCA) model to explore low-dimensional structures less obvious. Although there are several PCA alternatives for binary data in the psychometric, data analysis and machine learning literature, they are not well known to the bioinformatics community. Results: In this article, we introduce the motivation and rationale of some parametric and nonparametric versions of PCA specifically geared for binary data. Using both realistic simulations of binary data as well as mutation, CNA and methylation data of the Genomic Determinants of Sensitivity in Cancer 1000 (GDSC1000), the methods were explored for their performance with respect to finding the correct number of components, overfit, finding back the correct low-dimensional structure, variable importance, etc. The results show that if a low-dimensional structure exists in the data, that most of the methods can find it. When assuming a probabilistic generating process is underlying the data, we recommend to use the parametric logistic PCA model, while when such an assumption is not valid and the data are considered as given, the nonparametric Gifi model is recommended.

AVAILABILITY

The codes to reproduce the results in this article are available at the homepage of the Biosystems Data Analysis group (www.bdagroup.nl).

摘要

动机

全基因组水平的遗传和表观遗传改变测量产生了越来越多的高维二进制数据。二进制数据的特殊数学特征使得直接使用经典的主成分分析（PCA）模型来探索低维结构不太明显。尽管心理测量学、数据分析和机器学习文献中有几种针对二进制数据的 PCA 替代方法，但生物信息学社区并不熟悉它们。结果：在本文中，我们介绍了一些专门针对二进制数据的参数和非参数 PCA 版本的动机和基本原理。使用二进制数据的真实模拟以及癌症敏感性的基因组决定因素 1000 （GDSC1000）的突变、CNA 和甲基化数据，我们探讨了这些方法在确定正确组件数量、过度拟合、找到正确的低维结构、变量重要性等方面的性能。结果表明，如果数据中存在低维结构，那么大多数方法都可以找到它。当假设数据的底层生成过程是概率性的时，我们建议使用参数逻辑 PCA 模型，而当这种假设不成立且数据被视为给定的时，建议使用非参数 Gifi 模型。

可用性

本文结果的重现代码可在生物系统数据分析小组（www.bdagroup.nl）的主页上获得。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

二元基因组数据的主成分分析。

Principal component analysis of binary genomics data.

机构信息

出版信息

MOTIVATION

AVAILABILITY

动机

可用性

相似文献

引用本文的文献

二元基因组数据的主成分分析。

Principal component analysis of binary genomics data.

机构信息

出版信息

MOTIVATION

AVAILABILITY

动机

可用性

相似文献

引用本文的文献