• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

二元基因组数据的主成分分析。

Principal component analysis of binary genomics data.

机构信息

Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, The Netherlands.

Division of Molecular Carcinogenesis, The Netherlands Cancer Institute, Amsterdam, The Netherlands.

出版信息

Brief Bioinform. 2019 Jan 18;20(1):317-329. doi: 10.1093/bib/bbx119.

DOI:10.1093/bib/bbx119
PMID:30657888
Abstract

MOTIVATION

Genome-wide measurements of genetic and epigenetic alterations are generating more and more high-dimensional binary data. The special mathematical characteristics of binary data make the direct use of the classical principal component analysis (PCA) model to explore low-dimensional structures less obvious. Although there are several PCA alternatives for binary data in the psychometric, data analysis and machine learning literature, they are not well known to the bioinformatics community. Results: In this article, we introduce the motivation and rationale of some parametric and nonparametric versions of PCA specifically geared for binary data. Using both realistic simulations of binary data as well as mutation, CNA and methylation data of the Genomic Determinants of Sensitivity in Cancer 1000 (GDSC1000), the methods were explored for their performance with respect to finding the correct number of components, overfit, finding back the correct low-dimensional structure, variable importance, etc. The results show that if a low-dimensional structure exists in the data, that most of the methods can find it. When assuming a probabilistic generating process is underlying the data, we recommend to use the parametric logistic PCA model, while when such an assumption is not valid and the data are considered as given, the nonparametric Gifi model is recommended.

AVAILABILITY

The codes to reproduce the results in this article are available at the homepage of the Biosystems Data Analysis group (www.bdagroup.nl).

摘要

动机

全基因组水平的遗传和表观遗传改变测量产生了越来越多的高维二进制数据。二进制数据的特殊数学特征使得直接使用经典的主成分分析(PCA)模型来探索低维结构不太明显。尽管心理测量学、数据分析和机器学习文献中有几种针对二进制数据的 PCA 替代方法,但生物信息学社区并不熟悉它们。结果:在本文中,我们介绍了一些专门针对二进制数据的参数和非参数 PCA 版本的动机和基本原理。使用二进制数据的真实模拟以及癌症敏感性的基因组决定因素 1000 (GDSC1000)的突变、CNA 和甲基化数据,我们探讨了这些方法在确定正确组件数量、过度拟合、找到正确的低维结构、变量重要性等方面的性能。结果表明,如果数据中存在低维结构,那么大多数方法都可以找到它。当假设数据的底层生成过程是概率性的时,我们建议使用参数逻辑 PCA 模型,而当这种假设不成立且数据被视为给定的时,建议使用非参数 Gifi 模型。

可用性

本文结果的重现代码可在生物系统数据分析小组(www.bdagroup.nl)的主页上获得。

相似文献

1
Principal component analysis of binary genomics data.二元基因组数据的主成分分析。
Brief Bioinform. 2019 Jan 18;20(1):317-329. doi: 10.1093/bib/bbx119.
2
Comparing the performance of linear and nonlinear principal components in the context of high-dimensional genomic data integration.在高维基因组数据整合背景下比较线性主成分和非线性主成分的性能。
Stat Appl Genet Mol Biol. 2017 Jul 26;16(3):199-216. doi: 10.1515/sagmb-2016-0066.
3
Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA.结合多维基因组测量以预测癌症预后:来自癌症基因组图谱(TCGA)的观察结果
Brief Bioinform. 2015 Mar;16(2):291-303. doi: 10.1093/bib/bbu003. Epub 2014 Mar 13.
4
Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data.应用稳定性选择方法在高维分子数据中一致估计稀疏主成分。
Bioinformatics. 2015 Aug 15;31(16):2683-90. doi: 10.1093/bioinformatics/btv197. Epub 2015 Apr 10.
5
BEclear: Batch Effect Detection and Adjustment in DNA Methylation Data.BEclear:DNA甲基化数据中的批次效应检测与调整
PLoS One. 2016 Aug 25;11(8):e0159921. doi: 10.1371/journal.pone.0159921. eCollection 2016.
6
Predicting censored survival data based on the interactions between meta-dimensional omics data in breast cancer.基于乳腺癌元维度组学数据间的相互作用预测删失生存数据。
J Biomed Inform. 2015 Aug;56:220-8. doi: 10.1016/j.jbi.2015.05.019. Epub 2015 Jun 3.
7
MethCNA: a database for integrating genomic and epigenomic data in human cancer.MethCNA:一个整合人类癌症基因组和表观基因组数据的数据库。
BMC Genomics. 2018 Feb 13;19(1):138. doi: 10.1186/s12864-018-4525-0.
8
Outlier reset CUSUM for the exploration of copy number alteration data.用于探索拷贝数变异数据的异常值重置累积和方法
Stat Appl Genet Mol Biol. 2015 Aug;14(4):333-45. doi: 10.1515/sagmb-2014-0027.
9
Principal component analysis based methods in bioinformatics studies.基于主成分分析的生物信息学研究方法。
Brief Bioinform. 2011 Nov;12(6):714-22. doi: 10.1093/bib/bbq090. Epub 2011 Jan 17.
10
Nonlinear dimensionality reduction of gene expression data for visualization and clustering analysis of cancer tissue samples.基因表达数据的非线性维数降低,用于癌症组织样本的可视化和聚类分析。
Comput Biol Med. 2010 Aug;40(8):723-32. doi: 10.1016/j.compbiomed.2010.06.007. Epub 2010 Jul 16.

引用本文的文献

1
Logistic PCA explains differences between genome-scale metabolic models in terms of metabolic pathways.逻辑主成分分析根据代谢途径解释了基因组规模代谢模型之间的差异。
PLoS Comput Biol. 2024 Jun 24;20(6):e1012236. doi: 10.1371/journal.pcbi.1012236. eCollection 2024 Jun.
2
Identification and Preliminary Clinical Validation of Key Extracellular Proteins as the Potential Biomarkers in Hashimoto's Thyroiditis by Comprehensive Analysis.通过综合分析鉴定关键细胞外蛋白作为桥本甲状腺炎潜在生物标志物并进行初步临床验证
Biomedicines. 2023 Nov 24;11(12):3127. doi: 10.3390/biomedicines11123127.
3
The Role of NCS1 in Immunotherapy and Prognosis of Human Cancer.
NCS1在人类癌症免疫治疗和预后中的作用
Biomedicines. 2023 Oct 12;11(10):2765. doi: 10.3390/biomedicines11102765.
4
Roxadustat alleviates the inflammatory status in patients receiving maintenance hemodialysis with erythropoiesis-stimulating agent resistance by increasing the short-chain fatty acids producing gut bacteria.罗沙司他通过增加产生短链脂肪酸的肠道细菌来减轻接受促红细胞生成剂抵抗的维持性血液透析患者的炎症状态。
Eur J Med Res. 2023 Jul 10;28(1):230. doi: 10.1186/s40001-023-01179-3.
5
MMP1 acts as a potential regulator of tumor progression and dedifferentiation in papillary thyroid cancer.基质金属蛋白酶1在甲状腺乳头状癌中作为肿瘤进展和去分化的潜在调节因子发挥作用。
Front Oncol. 2022 Nov 21;12:1030590. doi: 10.3389/fonc.2022.1030590. eCollection 2022.
6
Predictive Biomarkers for Postmyocardial Infarction Heart Failure Using Machine Learning: A Secondary Analysis of a Cohort Study.使用机器学习预测心肌梗死后心力衰竭的生物标志物:一项队列研究的二次分析
Evid Based Complement Alternat Med. 2021 Dec 13;2021:2903543. doi: 10.1155/2021/2903543. eCollection 2021.
7
Epigenetic Biomarkers of Transition from Metabolically Healthy Obesity to Metabolically Unhealthy Obesity Phenotype: A Prospective Study.从代谢健康型肥胖向代谢不健康型肥胖表型转变的表观遗传生物标志物:一项前瞻性研究。
Int J Mol Sci. 2021 Sep 27;22(19):10417. doi: 10.3390/ijms221910417.
8
Clinical pharmacogenomics in action: design, assessment and implementation of a novel pharmacogenetic panel supporting drug selection for diseases of the central nervous system (CNS).临床药物基因组学的实际应用:一种支持中枢神经系统(CNS)疾病药物选择的新型药物遗传学检测板的设计、评估与实施
J Transl Med. 2021 Apr 15;19(1):151. doi: 10.1186/s12967-021-02816-3.
9
EZH2-TROAP Pathway Promotes Prostate Cancer Progression TWIST Signals.EZH2-TROAP通路通过TWIST信号促进前列腺癌进展。
Front Oncol. 2021 Feb 22;10:592239. doi: 10.3389/fonc.2020.592239. eCollection 2020.