• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于预测表型的机器学习评估:酵母、水稻和小麦的研究

An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat.

作者信息

Grinberg Nastasiya F, Orhobor Oghenejokpeme I, King Ross D

机构信息

1School of Computer Science, University of Manchester, Oxford Road, Manchester, M13 9PL UK.

2Present Address: Department of Medicine, Cambridge Institute of Therapeutic Immunology & Infectious Disease, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, University of Cambridge, Cambridge, CB2 0AW UK.

出版信息

Mach Learn. 2020;109(2):251-277. doi: 10.1007/s10994-019-05848-5. Epub 2019 Oct 23.

DOI:10.1007/s10994-019-05848-5
PMID:32174648
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7048706/
Abstract

In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial.

摘要

在表型预测中,生物体的物理特征是根据其基因型和环境知识来预测的。这类研究,通常称为全基因组关联研究,具有极高的社会重要性,因为它们对医学、作物育种等至关重要。我们研究了三个表型预测问题:一个简单且纯粹(酵母),另外两个复杂且贴近现实(水稻和小麦)。我们将标准机器学习方法;弹性网络、岭回归、套索回归、随机森林、梯度提升机(GBM)和支持向量机(SVM),与两种最先进的经典统计遗传学方法;基因组最佳线性无偏预测(GBLUP)和基于线性回归的两步序贯方法进行了比较。此外,利用纯净的酵母数据,我们研究了性能如何随生物机制的复杂性、观测噪声量、示例数量、缺失数据量以及不同数据表示的使用而变化。我们发现,对于几乎所有考虑的表型,标准机器学习方法都优于经典统计遗传学方法。在酵母问题上,最成功的方法是GBM,其次是套索回归,以及两种统计遗传学方法;随着机制复杂性增加,GBM最佳,而在较简单的情况下套索回归更优。在小麦和水稻研究中,最佳的两种方法是SVM和GBLUP。在存在噪声、缺失数据等情况下最稳健的方法是随机森林。发现基因组GBLUP的经典统计遗传学方法在存在群体结构的问题上表现良好。这表明,当存在群体结构信息时,标准机器学习方法需要进行改进以纳入该信息。我们得出结论,将机器学习方法应用于表型预测问题前景广阔,但确定哪种方法在任何给定问题上可能表现良好是难以捉摸且并非易事的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/bbffdff5acba/10994_2019_5848_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/0751f05eb731/10994_2019_5848_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/edab53112818/10994_2019_5848_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/86b4c2c3b4e0/10994_2019_5848_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/256933bcb6b5/10994_2019_5848_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/68cf227a4c12/10994_2019_5848_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/c6d6c3661f88/10994_2019_5848_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/849be8eb2f78/10994_2019_5848_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/c5ff60f875cb/10994_2019_5848_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/bbffdff5acba/10994_2019_5848_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/0751f05eb731/10994_2019_5848_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/edab53112818/10994_2019_5848_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/86b4c2c3b4e0/10994_2019_5848_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/256933bcb6b5/10994_2019_5848_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/68cf227a4c12/10994_2019_5848_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/c6d6c3661f88/10994_2019_5848_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/849be8eb2f78/10994_2019_5848_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/c5ff60f875cb/10994_2019_5848_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ef8/7048706/bbffdff5acba/10994_2019_5848_Fig9_HTML.jpg

相似文献

1
An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat.用于预测表型的机器学习评估:酵母、水稻和小麦的研究
Mach Learn. 2020;109(2):251-277. doi: 10.1007/s10994-019-05848-5. Epub 2019 Oct 23.
2
Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions.使用正则化线性回归模型的基因组选择:岭回归、套索回归、弹性网络及其扩展。
BMC Proc. 2012 May 21;6 Suppl 2(Suppl 2):S10. doi: 10.1186/1753-6561-6-S2-S10.
3
Comparing gradient boosting machine and Bayesian threshold BLUP for genome-based prediction of categorical traits in wheat breeding.梯度提升机和贝叶斯阈值 BLUP 用于小麦育种中基于基因组的分类性状预测的比较。
Plant Genome. 2022 Sep;15(3):e20214. doi: 10.1002/tpg2.20214. Epub 2022 May 10.
4
Predicting cow milk quality traits from routinely available milk spectra using statistical machine learning methods.利用统计机器学习方法从常规牛奶光谱中预测牛奶质量特性。
J Dairy Sci. 2021 Jul;104(7):7438-7447. doi: 10.3168/jds.2020-19576. Epub 2021 Apr 15.
5
Evaluating the performance of machine learning methods and variable selection methods for predicting difficult-to-measure traits in Holstein dairy cattle using milk infrared spectral data.利用牛奶近红外光谱数据评估机器学习方法和变量选择方法在荷斯坦奶牛中预测难以测量性状的性能。
J Dairy Sci. 2021 Jul;104(7):8107-8121. doi: 10.3168/jds.2020-19861. Epub 2021 Apr 15.
6
The effect of mislabeled phenotypic status on the identification of mutation-carriers from SNP genotypes in dairy cattle.错误标注的表型状态对从奶牛单核苷酸多态性(SNP)基因型中识别突变携带者的影响。
BMC Res Notes. 2017 Jun 26;10(1):230. doi: 10.1186/s13104-017-2540-x.
7
Dementia risk prediction in individuals with mild cognitive impairment: a comparison of Cox regression and machine learning models.轻度认知障碍个体的痴呆风险预测:Cox 回归和机器学习模型的比较。
BMC Med Res Methodol. 2022 Nov 2;22(1):284. doi: 10.1186/s12874-022-01754-y.
8
Identification of optimal prediction models using multi-omic data for selecting hybrid rice.利用多组学数据识别最佳预测模型,以选择杂交水稻。
Heredity (Edinb). 2019 Sep;123(3):395-406. doi: 10.1038/s41437-019-0210-6. Epub 2019 Mar 25.
9
NeuralLasso: Neural Networks Meet Lasso in Genomic Prediction.神经套索算法:基因组预测中神经网络与套索算法的结合
Front Plant Sci. 2022 Apr 29;13:800161. doi: 10.3389/fpls.2022.800161. eCollection 2022.
10
Using machine learning to realize genetic site screening and genomic prediction of productive traits in pigs.利用机器学习实现猪生产性状的遗传位点筛选和基因组预测。
FASEB J. 2023 Jun;37(6):e22961. doi: 10.1096/fj.202300245R.

引用本文的文献

1
Analyses of crop yield dynamics and the development of a multimodal neural network prediction model with G×E×M interactions.作物产量动态分析以及具有基因型×环境×管理相互作用的多模态神经网络预测模型的开发。
Front Plant Sci. 2025 Jul 31;16:1537990. doi: 10.3389/fpls.2025.1537990. eCollection 2025.
2
Genomic prediction with kinship-based multiple kernel learning produces hypothesis on the underlying inheritance mechanisms of phenotypic traits.基于亲缘关系的多核学习进行基因组预测,能够对表型性状的潜在遗传机制提出假设。
Genome Biol. 2025 Apr 4;26(1):84. doi: 10.1186/s13059-025-03544-3.
3
Integrating Deep Learning Models with Genome-Wide Association Study-Based Identification Enhanced Phenotype Predictions in Group A .

本文引用的文献

1
Meta-QSAR: a large-scale application of meta-learning to drug design and discovery.元定量构效关系(Meta-QSAR):元学习在药物设计与发现中的大规模应用。
Mach Learn. 2018;107(1):285-311. doi: 10.1007/s10994-017-5685-x. Epub 2017 Dec 22.
2
Prediction of treatment response in rheumatoid arthritis patients using genome-wide SNP data.利用全基因组单核苷酸多态性(SNP)数据预测类风湿关节炎患者的治疗反应。
Genet Epidemiol. 2018 Dec;42(8):754-771. doi: 10.1002/gepi.22159. Epub 2018 Oct 12.
3
From genome-wide associations to candidate causal variants by statistical fine-mapping.
将深度学习模型与基于全基因组关联研究的识别相结合可增强A组中的表型预测。
J Microbiol Biotechnol. 2025 Mar 26;35:e2411010. doi: 10.4014/jmb.2411.11010.
4
Learning genotype-phenotype associations from gaps in multi-species sequence alignments.从多物种序列比对的缺口处学习基因型-表型关联。
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf022.
5
Genomic and phenomic prediction for soybean seed yield, protein, and oil.大豆种子产量、蛋白质和油含量的基因组与表型预测
Plant Genome. 2025 Mar;18(1):e70002. doi: 10.1002/tpg2.70002.
6
HASCH - A high-throughput amplicon-based SNP-platform for medicinal cannabis and industrial hemp genotyping applications.HASCH - 一种高通量基于扩增子的 SNP 平台,用于药用大麻和工业大麻的基因分型应用。
BMC Genomics. 2024 Aug 29;25(1):818. doi: 10.1186/s12864-024-10734-z.
7
Analyzing Medicago spp. seed morphology using GWAS and machine learning.利用 GWAS 和机器学习分析紫花苜蓿属种子形态。
Sci Rep. 2024 Jul 30;14(1):17588. doi: 10.1038/s41598-024-67790-4.
8
A practical introduction to holo-omics.全息组学实用入门
Cell Rep Methods. 2024 Jul 15;4(7):100820. doi: 10.1016/j.crmeth.2024.100820. Epub 2024 Jul 9.
9
Integrating Bioinformatics and Machine Learning for Genomic Prediction in Chickens.将生物信息学和机器学习整合用于鸡的基因组预测。
Genes (Basel). 2024 May 26;15(6):690. doi: 10.3390/genes15060690.
10
Abiotic Stress Tolerance Boosted by Genetic Diversity in Plants.植物遗传多样性增强非生物胁迫耐受性。
Int J Mol Sci. 2024 May 14;25(10):5367. doi: 10.3390/ijms25105367.
从全基因组关联研究到通过统计精细映射确定候选因果变异。
Nat Rev Genet. 2018 Aug;19(8):491-504. doi: 10.1038/s41576-018-0016-z.
4
Detecting gene-gene interactions using a permutation-based random forest method.使用基于排列的随机森林方法检测基因-基因相互作用。
BioData Min. 2016 Apr 6;9:14. doi: 10.1186/s13040-016-0093-5. eCollection 2016.
5
Do little interactions get lost in dark random forests?微小的相互作用会在黑暗的随机森林中消失吗?
BMC Bioinformatics. 2016 Mar 31;17:145. doi: 10.1186/s12859-016-0995-8.
6
MTG2: an efficient algorithm for multivariate linear mixed model analysis based on genomic information.MTG2:一种基于基因组信息的多元线性混合模型分析的高效算法。
Bioinformatics. 2016 May 1;32(9):1420-2. doi: 10.1093/bioinformatics/btw012. Epub 2016 Jan 10.
7
A gene-based association method for mapping traits using reference transcriptome data.一种利用参考转录组数据进行性状定位的基于基因的关联方法。
Nat Genet. 2015 Sep;47(9):1091-8. doi: 10.1038/ng.3367. Epub 2015 Aug 10.
8
Efficient set tests for the genetic analysis of correlated traits.高效集检验在相关性状遗传分析中的应用。
Nat Methods. 2015 Aug;12(8):755-8. doi: 10.1038/nmeth.3439. Epub 2015 Jun 15.
9
Genomic selection and association mapping in rice (Oryza sativa): effect of trait genetic architecture, training population composition, marker number and statistical model on accuracy of rice genomic selection in elite, tropical rice breeding lines.水稻(Oryza sativa)的基因组选择与关联图谱分析:性状遗传结构、训练群体组成、标记数量及统计模型对优质热带水稻育种系基因组选择准确性的影响
PLoS Genet. 2015 Feb 17;11(2):e1004982. doi: 10.1371/journal.pgen.1004982. eCollection 2015 Feb.
10
Genetic studies of body mass index yield new insights for obesity biology.遗传研究体重指数为肥胖生物学提供了新的见解。
Nature. 2015 Feb 12;518(7538):197-206. doi: 10.1038/nature14177.