• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在基因组学中应用机器学习的陷阱。

Navigating the pitfalls of applying machine learning in genomics.

机构信息

Gladstone Institutes, San Francisco, CA, USA.

Department of Genetics, Stanford University, Stanford, CA, USA.

出版信息

Nat Rev Genet. 2022 Mar;23(3):169-181. doi: 10.1038/s41576-021-00434-9. Epub 2021 Nov 26.

DOI:10.1038/s41576-021-00434-9
PMID:34837041
Abstract

The scale of genetic, epigenomic, transcriptomic, cheminformatic and proteomic data available today, coupled with easy-to-use machine learning (ML) toolkits, has propelled the application of supervised learning in genomics research. However, the assumptions behind the statistical models and performance evaluations in ML software frequently are not met in biological systems. In this Review, we illustrate the impact of several common pitfalls encountered when applying supervised ML in genomics. We explore how the structure of genomics data can bias performance evaluations and predictions. To address the challenges associated with applying cutting-edge ML methods to genomics, we describe solutions and appropriate use cases where ML modelling shows great potential.

摘要

今天,可用的遗传、表观基因组学、转录组学、化学信息学和蛋白质组学数据的规模,加上易于使用的机器学习 (ML) 工具包,推动了监督学习在基因组学研究中的应用。然而,ML 软件中的统计模型和性能评估背后的假设在生物系统中经常得不到满足。在这篇综述中,我们举例说明了在基因组学中应用监督 ML 时遇到的几个常见陷阱的影响。我们探讨了基因组学数据的结构如何影响性能评估和预测。为了解决将最先进的 ML 方法应用于基因组学所带来的挑战,我们描述了一些解决方案和适当的用例,在这些用例中,ML 建模显示出了巨大的潜力。

相似文献

1
Navigating the pitfalls of applying machine learning in genomics.在基因组学中应用机器学习的陷阱。
Nat Rev Genet. 2022 Mar;23(3):169-181. doi: 10.1038/s41576-021-00434-9. Epub 2021 Nov 26.
2
A primer on deep learning in genomics.深度学习在基因组学中的应用简介。
Nat Genet. 2019 Jan;51(1):12-18. doi: 10.1038/s41588-018-0295-5. Epub 2018 Nov 26.
3
seqQscorer: automated quality control of next-generation sequencing data using machine learning.seqQscorer:使用机器学习进行下一代测序数据的自动化质量控制。
Genome Biol. 2021 Mar 5;22(1):75. doi: 10.1186/s13059-021-02294-2.
4
The application potential of machine learning and genomics for understanding natural product diversity, chemistry, and therapeutic translatability.机器学习和基因组学在理解天然产物多样性、化学和治疗可转化性方面的应用潜力。
Nat Prod Rep. 2021 Jun 23;38(6):1100-1108. doi: 10.1039/d0np00055h.
5
Genome-wide prediction of cis-regulatory regions using supervised deep learning methods.基于监督深度学习方法的全基因组顺式调控区预测。
BMC Bioinformatics. 2018 May 31;19(1):202. doi: 10.1186/s12859-018-2187-1.
6
Deep learning for computational biology.用于计算生物学的深度学习。
Mol Syst Biol. 2016 Jul 29;12(7):878. doi: 10.15252/msb.20156651.
7
Deep learning: new computational modelling techniques for genomics.深度学习:基因组学的新计算建模技术。
Nat Rev Genet. 2019 Jul;20(7):389-403. doi: 10.1038/s41576-019-0122-6.
8
KAML: improving genomic prediction accuracy of complex traits using machine learning determined parameters.KAML:使用机器学习确定的参数来提高复杂性状的基因组预测准确性。
Genome Biol. 2020 Jun 17;21(1):146. doi: 10.1186/s13059-020-02052-w.
9
Answering open questions in biology using spatial genomics and structured methods.利用空间基因组学和结构化方法回答生物学中的开放性问题。
BMC Bioinformatics. 2024 Sep 4;25(1):291. doi: 10.1186/s12859-024-05912-5.
10
Machine learning in plant-pathogen interactions: empowering biological predictions from field scale to genome scale.植物-病原体相互作用中的机器学习:助力从田间尺度到基因组尺度的生物学预测。
New Phytol. 2020 Oct;228(1):35-41. doi: 10.1111/nph.15771. Epub 2019 Mar 26.

引用本文的文献

1
Machine learning on multiple epigenetic features reveals H3K27Ac as a driver of gene expression prediction across patients with glioblastoma.基于多种表观遗传特征的机器学习揭示了H3K27Ac是胶质母细胞瘤患者基因表达预测的驱动因素。
PLoS Comput Biol. 2025 Aug 7;21(8):e1012272. doi: 10.1371/journal.pcbi.1012272. eCollection 2025 Aug.
2
Ten simple rules for navigating AI in science.在科学领域驾驭人工智能的十条简单规则。
PLoS Comput Biol. 2025 Jul 18;21(7):e1013259. doi: 10.1371/journal.pcbi.1013259. eCollection 2025 Jul.
3
Construction of an oligometastatic prediction model for nasopharyngeal carcinoma patients based on pathomics features and dynamic multi-swarm particle swarm optimization support vector machine.

本文引用的文献

1
Adversarial deconfounding autoencoder for learning robust gene expression embeddings.用于学习稳健基因表达嵌入的对抗性去混淆自动编码器。
Bioinformatics. 2020 Dec 30;36(Suppl_2):i573-i582. doi: 10.1093/bioinformatics/btaa796.
2
A pitfall for machine learning methods aiming to predict across cell types.旨在跨细胞类型进行预测的机器学习方法的一个陷阱。
Genome Biol. 2020 Nov 19;21(1):282. doi: 10.1186/s13059-020-02177-y.
3
Protein profiles: Biases and protocols.蛋白质谱:偏差与方案。
基于病理组学特征和动态多群粒子群优化支持向量机构建鼻咽癌寡转移预测模型
Front Oncol. 2025 Jun 19;15:1589919. doi: 10.3389/fonc.2025.1589919. eCollection 2025.
4
Overcoming limitations to customize DeepVariant for domesticated animals with TrioTrain.利用TrioTrain克服限制以定制适用于家养动物的DeepVariant。
Genome Res. 2025 Aug 1;35(8):1859-1874. doi: 10.1101/gr.279542.124.
5
Assessment of molecular and morphological dynamics during long-time cultivation of cryopreserved human ovarian tissue: risk of genetic alterations.冷冻保存的人卵巢组织长期培养过程中的分子和形态动力学评估:基因改变的风险
Front Endocrinol (Lausanne). 2025 May 2;15:1463614. doi: 10.3389/fendo.2024.1463614. eCollection 2024.
6
Genome language modeling (GLM): a beginner's cheat sheet.基因组语言建模(GLM):初学者简易指南。
Biol Methods Protoc. 2025 Mar 25;10(1):bpaf022. doi: 10.1093/biomethods/bpaf022. eCollection 2025.
7
Type-2 diabetes epigenetic biomarkers: present status and future directions for global and Indigenous health.2型糖尿病的表观遗传生物标志物:全球和原住民健康的现状与未来方向
Front Mol Biosci. 2025 Apr 28;12:1502640. doi: 10.3389/fmolb.2025.1502640. eCollection 2025.
8
Machine learning-based meta-analysis reveals gut microbiome alterations associated with Parkinson's disease.基于机器学习的荟萃分析揭示了与帕金森病相关的肠道微生物群改变。
Nat Commun. 2025 May 7;16(1):4227. doi: 10.1038/s41467-025-56829-3.
9
Compositional transformations can reasonably introduce phenotype-associated values into sparse features.成分转换可以合理地将与表型相关的值引入稀疏特征中。
mSystems. 2025 May 20;10(5):e0002125. doi: 10.1128/msystems.00021-25. Epub 2025 May 2.
10
The classification method of donkey breeds based on SNPs data and machine learning.基于单核苷酸多态性(SNP)数据和机器学习的驴品种分类方法
Front Genet. 2025 Apr 9;16:1496246. doi: 10.3389/fgene.2025.1496246. eCollection 2025.
Comput Struct Biotechnol J. 2020 Aug 27;18:2281-2289. doi: 10.1016/j.csbj.2020.08.015. eCollection 2020.
4
On the cross-population generalizability of gene expression prediction models.基于跨人群的基因表达预测模型的泛化能力。
PLoS Genet. 2020 Aug 14;16(8):e1008927. doi: 10.1371/journal.pgen.1008927. eCollection 2020 Aug.
5
Identifying CpG methylation signature as a promising biomarker for recurrence and immunotherapy in non-small-cell lung carcinoma.鉴定 CpG 甲基化特征作为非小细胞肺癌复发和免疫治疗的有前途的生物标志物。
Aging (Albany NY). 2020 Jul 28;12(14):14649-14676. doi: 10.18632/aging.103517.
6
RaptRanker: in silico RNA aptamer selection from HT-SELEX experiment based on local sequence and structure information.RaptRanker:基于局部序列和结构信息的 HT-SELEX 实验中 RNA 适体的计算选择。
Nucleic Acids Res. 2020 Aug 20;48(14):e82. doi: 10.1093/nar/gkaa484.
7
Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis.医学影像数据集的性别失衡会导致计算机辅助诊断的分类器产生偏差。
Proc Natl Acad Sci U S A. 2020 Jun 9;117(23):12592-12594. doi: 10.1073/pnas.1919012117. Epub 2020 May 26.
8
On instabilities of deep learning in image reconstruction and the potential costs of AI.深度学习在图像重建中的不稳定性及人工智能的潜在代价
Proc Natl Acad Sci U S A. 2020 Dec 1;117(48):30088-30095. doi: 10.1073/pnas.1907377117. Epub 2020 May 11.
9
Machine learning and clinical epigenetics: a review of challenges for diagnosis and classification.机器学习与临床表观遗传学:诊断与分类挑战述评。
Clin Epigenetics. 2020 Apr 3;12(1):51. doi: 10.1186/s13148-020-00842-4.
10
Veridical data science.真实数据科学。
Proc Natl Acad Sci U S A. 2020 Feb 25;117(8):3920-3929. doi: 10.1073/pnas.1901326117. Epub 2020 Feb 13.