复杂基因组数据中的机器学习与数据挖掘——遗传分析研讨会19的经验教训综述

Machine learning and data mining in complex genomic data--a review on the lessons learned in Genetic Analysis Workshop 19.

作者信息

König Inke R, Auerbach Jonathan, Gola Damian, Held Elizabeth, Holzinger Emily R, Legault Marc-André, Sun Rui, Tintle Nathan, Yang Hsin-Chou

机构信息

Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany.

Department of Statistics, Columbia University, New York, NY, 10027, USA.

出版信息

BMC Genet. 2016 Feb 3;17 Suppl 2(Suppl 2):1. doi: 10.1186/s12863-015-0315-8.

DOI:10.1186/s12863-015-0315-8

PMID:26866367

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4895282/

Abstract

In the analysis of current genomic data, application of machine learning and data mining techniques has become more attractive given the rising complexity of the projects. As part of the Genetic Analysis Workshop 19, approaches from this domain were explored, mostly motivated from two starting points. First, assuming an underlying structure in the genomic data, data mining might identify this and thus improve downstream association analyses. Second, computational methods for machine learning need to be developed further to efficiently deal with the current wealth of data.In the course of discussing results and experiences from the machine learning and data mining approaches, six common messages were extracted. These depict the current state of these approaches in the application to complex genomic data. Although some challenges remain for future studies, important forward steps were taken in the integration of different data types and the evaluation of the evidence. Mining the data for underlying genetic or phenotypic structure and using this information in subsequent analyses proved to be extremely helpful and is likely to become of even greater use with more complex data sets.

摘要

在当前基因组数据分析中，鉴于项目复杂性不断增加，机器学习和数据挖掘技术的应用变得更具吸引力。作为遗传分析研讨会19的一部分，探讨了该领域的方法，主要基于两个出发点。首先，假设基因组数据存在潜在结构，数据挖掘可能会识别出这种结构，从而改进下游关联分析。其次，机器学习的计算方法需要进一步发展，以有效处理当前丰富的数据。在讨论机器学习和数据挖掘方法的结果及经验过程中，提取了六条共同信息。这些信息描述了这些方法在应用于复杂基因组数据时的当前状态。尽管未来研究仍面临一些挑战，但在整合不同数据类型和评估证据方面已迈出重要的前进步伐。挖掘数据以寻找潜在的遗传或表型结构，并在后续分析中使用这些信息，已证明非常有帮助，并且随着数据集变得更加复杂，可能会发挥更大的作用。

相似文献

Machine learning and data mining in complex genomic data--a review on the lessons learned in Genetic Analysis Workshop 19.复杂基因组数据中的机器学习与数据挖掘——遗传分析研讨会19的经验教训综述

BMC Genet. 2016 Feb 3;17 Suppl 2(Suppl 2):1. doi: 10.1186/s12863-015-0315-8.

Introducing Machine Learning Concepts with WEKA.使用WEKA介绍机器学习概念。

Methods Mol Biol. 2016;1418:353-78. doi: 10.1007/978-1-4939-3578-9_17.

A Review on Methods for Detecting SNP Interactions in High-Dimensional Genomic Data.高维基因组数据中 SNP 相互作用检测方法的研究综述。

IEEE/ACM Trans Comput Biol Bioinform. 2018 Mar-Apr;15(2):599-612. doi: 10.1109/TCBB.2016.2635125. Epub 2016 Dec 2.

Machine learning approaches to analysing textual injury surveillance data: a systematic review.用于分析文本损伤监测数据的机器学习方法：一项系统综述。

Accid Anal Prev. 2015 Jun;79:41-9. doi: 10.1016/j.aap.2015.03.018. Epub 2015 Mar 19.

Machine learning technology in the application of genome analysis: A systematic review.机器学习技术在基因组分析中的应用：系统评价。

Gene. 2019 Jul 15;705:149-156. doi: 10.1016/j.gene.2019.04.062. Epub 2019 Apr 23.

A systematic review of data mining and machine learning for air pollution epidemiology.空气污染流行病学中数据挖掘与机器学习的系统综述。

BMC Public Health. 2017 Nov 28;17(1):907. doi: 10.1186/s12889-017-4914-3.

Machine learning and graph analytics in computational biomedicine.计算生物医学中的机器学习与图形分析

Artif Intell Med. 2017 Nov;83:1. doi: 10.1016/j.artmed.2017.09.003. Epub 2017 Sep 7.

Multiple-Swarm Ensembles: Improving the Predictive Power and Robustness of Predictive Models and Its Use in Computational Biology.多群集集成：提高预测模型的预测能力和稳健性及其在计算生物学中的应用。

IEEE/ACM Trans Comput Biol Bioinform. 2018 May-Jun;15(3):926-933. doi: 10.1109/TCBB.2017.2691329. Epub 2017 Apr 5.

Big-Data Analysis, Cluster Analysis, and Machine-Learning Approaches.大数据分析、聚类分析和机器学习方法。

Adv Exp Med Biol. 2018;1065:607-626. doi: 10.1007/978-3-319-77932-4_37.

Revisit of Machine Learning Supported Biological and Biomedical Studies.机器学习支持的生物学和生物医学研究回顾

Methods Mol Biol. 2018;1754:183-204. doi: 10.1007/978-1-4939-7717-8_11.

引用本文的文献

Feature Selection Strategies for Deep Learning-Based Classification in Ultra-High-Dimensional Genomic Data.超高维基因组数据中基于深度学习分类的特征选择策略

Int J Mol Sci. 2025 Aug 18;26(16):7961. doi: 10.3390/ijms26167961.

Ecometabolomics reveal physiological adaptations of Asiatic toads (Bufo gargarizans Cantor, 1842) to different environments along an altitudinal gradient.生态代谢组学揭示了中华蟾蜍（Bufo gargarizans Cantor，1842）沿海拔梯度对不同环境的生理适应性。

Front Zool. 2025 Aug 15;22(1):21. doi: 10.1186/s12983-025-00577-z.

Acute Myeloid Leukemia Genome Characterization Study and Subtype Classification Employing Feature Selection and Bayesian Networks.急性髓系白血病基因组特征研究及基于特征选择和贝叶斯网络的亚型分类

Biomedicines. 2025 Apr 28;13(5):1067. doi: 10.3390/biomedicines13051067.

Association analysis of response to take-all disease with agronomic traits and molecular markers and selection ideal genotypes in bread wheat ( L.) genotypes.面包小麦（L.）基因型中全蚀病抗性与农艺性状及分子标记的关联分析与理想基因型选择

Mol Breed. 2025 Mar 26;45(4):36. doi: 10.1007/s11032-025-01554-4. eCollection 2025 Apr.

Redefining the Tea Green Leafhopper: Matsuda (Hemiptera: Cicadellidae) as a Vital Asset in Premium Tea Production.重新定义茶绿叶蝉：松田叶蝉（半翅目：叶蝉科）是优质茶叶生产中的重要资产。

Life (Basel). 2025 Jan 20;15(1):133. doi: 10.3390/life15010133.

The transcription factor CCT30 promotes rice preharvest sprouting by regulating sugar signalling to inhibit the ABA-mediated pathway.转录因子CCT30通过调节糖信号传导以抑制脱落酸介导的途径来促进水稻收获前发芽。

Plant Biotechnol J. 2025 Feb;23(2):579-591. doi: 10.1111/pbi.14521. Epub 2024 Dec 2.

A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction.基于机器学习的疾病风险预测的特征选择方法综述

Front Bioinform. 2022 Jun 27;2:927312. doi: 10.3389/fbinf.2022.927312. eCollection 2022.

Machine learning approach to single nucleotide polymorphism-based asthma prediction.基于单核苷酸多态性的哮喘预测的机器学习方法。

PLoS One. 2019 Dec 4;14(12):e0225574. doi: 10.1371/journal.pone.0225574. eCollection 2019.

Presidential address: Six open questions to genetic epidemiologists.主席致辞：致遗传流行病学家的六个开放性问题。

Genet Epidemiol. 2019 Apr;43(3):242-249. doi: 10.1002/gepi.22191. Epub 2019 Jan 19.

Omics-squared: human genomic, transcriptomic and phenotypic data for genetic analysis workshop 19.组学平方：用于遗传分析研讨会19的人类基因组、转录组和表型数据

BMC Proc. 2016 Oct 18;10(Suppl 7):71-77. doi: 10.1186/s12919-016-0008-y. eCollection 2016.

本文引用的文献

Homozygosity disequilibrium and its gene regulation.纯合性不平衡及其基因调控。

BMC Proc. 2016 Oct 18;10(Suppl 7):159-163. doi: 10.1186/s12919-016-0023-z. eCollection 2016.

A clustering approach to identify rare variants associated with hypertension.一种用于识别与高血压相关的罕见变异的聚类方法。

BMC Proc. 2016 Oct 18;10(Suppl 7):153-157. doi: 10.1186/s12919-016-0022-0. eCollection 2016.

Comparison of parametric and machine methods for variable selection in simulated Genetic Analysis Workshop 19 data.模拟遗传分析研讨会19数据中变量选择的参数方法与机器学习方法的比较

BMC Proc. 2016 Oct 18;10(Suppl 7):147-152. doi: 10.1186/s12919-016-0021-1. eCollection 2016.

Comparing machine learning and logistic regression methods for predicting hypertension using a combination of gene expression and next-generation sequencing data.比较使用基因表达和下一代测序数据组合预测高血压的机器学习和逻辑回归方法。

BMC Proc. 2016 Oct 18;10(Suppl 7):141-145. doi: 10.1186/s12919-016-0020-2. eCollection 2016.

Identification of interactions using model-based multifactor dimensionality reduction.使用基于模型的多因素降维方法识别相互作用。

BMC Proc. 2016 Oct 18;10(Suppl 7):135-139. doi: 10.1186/s12919-016-0019-8. eCollection 2016.

Identifying regions of disease-related variants in admixed populations with the summation partition approach.使用求和划分方法识别混合人群中疾病相关变异区域。

BMC Proc. 2016 Oct 18;10(Suppl 7):131-134. doi: 10.1186/s12919-016-0018-9. eCollection 2016.

Methods of integrating data to uncover genotype-phenotype interactions.整合数据以揭示基因型-表型相互作用的方法。

Nat Rev Genet. 2015 Feb;16(2):85-97. doi: 10.1038/nrg3868. Epub 2015 Jan 13.

Analysis of homozygosity disequilibrium using whole-genome sequencing data.利用全基因组测序数据进行纯合性不平衡分析。

BMC Proc. 2014 Jun 17;8(Suppl 1 Genetic Analysis Workshop 18Vanessa Olmo):S15. doi: 10.1186/1753-6561-8-S1-S15. eCollection 2014.

Comparing logistic regression, support vector machines, and permanental classification methods in predicting hypertension.比较逻辑回归、支持向量机和积和式分类方法在预测高血压方面的表现。

BMC Proc. 2014 Jun 17;8(Suppl 1):S96. doi: 10.1186/1753-6561-8-S1-S96. eCollection 2014.

Genetic simulation tools for post-genome wide association studies of complex diseases.用于复杂疾病基因组全关联研究后的遗传模拟工具。

Genet Epidemiol. 2015 Jan;39(1):11-19. doi: 10.1002/gepi.21870. Epub 2014 Nov 4.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验