使用机器学习评估植物基因模型

Evaluating Plant Gene Models Using Machine Learning.

作者信息

Upadhyaya Shriprabha R, Bayer Philipp E, Tay Fernandez Cassandria G, Petereit Jakob, Batley Jacqueline, Bennamoun Mohammed, Boussaid Farid, Edwards David

机构信息

School of Biological Sciences, University of Western Australia, Perth, WA 6000, Australia.

Department of Computer Science and Software Engineering, University of Western Australia, Perth, WA 6000, Australia.

出版信息

Plants (Basel). 2022 Jun 20;11(12):1619. doi: 10.3390/plants11121619.

DOI:10.3390/plants11121619

PMID:35736770

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9230120/

Abstract

Gene models are regions of the genome that can be transcribed into RNA and translated to proteins, or belong to a class of non-coding RNA genes. The prediction of gene models is a complex process that can be unreliable, leading to false positive annotations. To help support the calling of confident conserved gene models and minimize false positives arising during gene model prediction we have developed Truegene, a machine learning approach to classify potential low confidence gene models using 14 gene and 41 protein-based characteristics. Amino acid and nucleotide sequence-based features were calculated for conserved (high confidence) and non-conserved (low confidence) annotated genes from the published Cameor genome. These features were used to train eXtreme Gradient Boost (XGBoost) classifier models to predict whether a gene model is likely to be real. The optimized models demonstrated a prediction accuracy ranging from 87% to 90% and an F-1 score of 0.91-0.94. We used SHapley Additive exPlanations (SHAP) and feature importance plots to identify the features that contribute to the model predictions, and we show that protein and gene-based features can be used to build accurate models for gene prediction that have applications in supporting future gene annotation processes.

摘要

基因模型是基因组中可转录为RNA并翻译为蛋白质的区域，或者属于一类非编码RNA基因。基因模型的预测是一个复杂的过程，可能不可靠，会导致错误的阳性注释。为了帮助支持可靠的保守基因模型的识别，并尽量减少基因模型预测过程中出现的假阳性，我们开发了Truegene，这是一种机器学习方法，使用14个基于基因和41个基于蛋白质的特征对潜在的低置信度基因模型进行分类。从已发表的Cameor基因组中计算出保守（高置信度）和非保守（低置信度）注释基因的基于氨基酸和核苷酸序列的特征。这些特征用于训练极端梯度提升（XGBoost）分类器模型，以预测基因模型是否可能是真实的。优化后的模型显示预测准确率在87%至90%之间，F-1分数为0.91-0.94。我们使用SHapley加法解释（SHAP）和特征重要性图来识别对模型预测有贡献的特征，并且我们表明基于蛋白质和基因的特征可用于构建准确的基因预测模型，这些模型可用于支持未来的基因注释过程。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/63f1/9230120/2d09f90f012e/plants-11-01619-g001.jpg

相似文献

Evaluating Plant Gene Models Using Machine Learning.使用机器学习评估植物基因模型

Plants (Basel). 2022 Jun 20;11(12):1619. doi: 10.3390/plants11121619.

Explainable machine learning model to predict refeeding hypophosphatemia.解释性机器学习模型预测再喂养性低磷血症。

Clin Nutr ESPEN. 2021 Oct;45:213-219. doi: 10.1016/j.clnesp.2021.08.022. Epub 2021 Sep 10.

Comparison of Four Machine Learning Techniques for Prediction of Intensive Care Unit Length of Stay in Heart Transplantation Patients.四种机器学习技术用于预测心脏移植患者重症监护病房住院时长的比较

Front Cardiovasc Med. 2022 Jun 21;9:863642. doi: 10.3389/fcvm.2022.863642. eCollection 2022.

Prediction of the development of acute kidney injury following cardiac surgery by machine learning.机器学习预测心脏手术后急性肾损伤的发生。

Crit Care. 2020 Jul 31;24(1):478. doi: 10.1186/s13054-020-03179-9.

Interpretable Machine Learning for Early Prediction of Prognosis in Sepsis: A Discovery and Validation Study.用于脓毒症预后早期预测的可解释机器学习：一项发现与验证研究。

Infect Dis Ther. 2022 Jun;11(3):1117-1132. doi: 10.1007/s40121-022-00628-6. Epub 2022 Apr 10.

Prediction Model of Osteonecrosis of the Femoral Head After Femoral Neck Fracture: Machine Learning-Based Development and Validation Study.股骨颈骨折后股骨头坏死的预测模型：基于机器学习的开发与验证研究

JMIR Med Inform. 2021 Nov 19;9(11):e30079. doi: 10.2196/30079.

Application of machine learning approaches to predict the 5-year survival status of patients with esophageal cancer.应用机器学习方法预测食管癌患者的5年生存状况。

J Thorac Dis. 2021 Nov;13(11):6240-6251. doi: 10.21037/jtd-21-1107.

Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions.使用 Shapley 值解释机器学习模型：在化合物效力和多靶点活性预测中的应用。

J Comput Aided Mol Des. 2020 Oct;34(10):1013-1026. doi: 10.1007/s10822-020-00314-0. Epub 2020 May 2.

Explainable machine learning model for predicting the occurrence of postoperative malnutrition in children with congenital heart disease.用于预测先天性心脏病儿童术后发生营养不良的可解释机器学习模型。

Clin Nutr. 2022 Jan;41(1):202-210. doi: 10.1016/j.clnu.2021.11.006. Epub 2021 Nov 10.

Using a machine learning approach to predict mortality in critically ill influenza patients: a cross-sectional retrospective multicentre study in Taiwan.运用机器学习方法预测危重症流感患者的死亡率：台湾一项跨中心回顾性研究

BMJ Open. 2020 Feb 25;10(2):e033898. doi: 10.1136/bmjopen-2019-033898.

引用本文的文献

Technological Development and Advances for Constructing and Analyzing Plant Pangenomes.构建和分析植物泛基因组的技术发展与进展。

Genome Biol Evol. 2024 Apr 2;16(4). doi: 10.1093/gbe/evae081.

Unravelling inversions: Technological advances, challenges, and potential impact on crop breeding.解开倒位之谜：技术进步、挑战及对作物育种的潜在影响。

Plant Biotechnol J. 2024 Mar;22(3):544-554. doi: 10.1111/pbi.14224. Epub 2023 Nov 14.

本文引用的文献

Representation and participation across 20 years of plant genome sequencing.二十年来植物基因组测序的表现与参与。

Nat Plants. 2021 Dec;7(12):1571-1578. doi: 10.1038/s41477-021-01031-8. Epub 2021 Nov 29.

Modelling of gene loss propensity in the pangenomes of three Brassica species suggests different mechanisms between polyploids and diploids.在三个芸薹属物种的泛基因组中对基因丢失倾向进行建模表明，多倍体和二倍体之间存在不同的机制。

Plant Biotechnol J. 2021 Dec;19(12):2488-2500. doi: 10.1111/pbi.13674. Epub 2021 Aug 24.

Balrog: A universal protein model for prokaryotic gene prediction.巴尔罗格：用于原核基因预测的通用蛋白质模型。

PLoS Comput Biol. 2021 Feb 26;17(2):e1008727. doi: 10.1371/journal.pcbi.1008727. eCollection 2021 Feb.

Rapid discovery of novel prophages using biological feature engineering and machine learning.利用生物特征工程和机器学习快速发现新型原噬菌体

NAR Genom Bioinform. 2021 Jan 6;3(1):lqaa109. doi: 10.1093/nargab/lqaa109. eCollection 2021 Mar.

Prevalence of alternative AUG and non-AUG translation initiators and their regulatory effects across plants.植物中替代 AUG 和非 AUG 翻译起始子的普遍性及其调控作用。

Genome Res. 2020 Oct;30(10):1418-1433. doi: 10.1101/gr.261834.120. Epub 2020 Sep 24.

Plant pan-genomes are the new reference.植物泛基因组成为新的参考。

Nat Plants. 2020 Aug;6(8):914-920. doi: 10.1038/s41477-020-0733-0. Epub 2020 Jul 20.

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation.马修斯相关系数（MCC）在二分类评估中优于 F1 得分和准确率的优势。

BMC Genomics. 2020 Jan 2;21(1):6. doi: 10.1186/s12864-019-6413-7.

A reference genome for pea provides insight into legume genome evolution.豌豆参考基因组揭示豆科基因组进化。

Nat Genet. 2019 Sep;51(9):1411-1422. doi: 10.1038/s41588-019-0480-1. Epub 2019 Sep 2.

Genes and gene models, an important distinction.基因与基因模型，一个重要的区别。

New Phytol. 2020 Oct;228(1):50-55. doi: 10.1111/nph.16011. Epub 2019 Aug 4.

High-throughput sequencing data and the impact of plant gene annotation quality.高通量测序数据和植物基因注释质量的影响。

J Exp Bot. 2019 Feb 20;70(4):1069-1076. doi: 10.1093/jxb/ery434.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用机器学习评估植物基因模型

Evaluating Plant Gene Models Using Machine Learning.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献