Suppr超能文献

使用机器学习评估植物基因模型

Evaluating Plant Gene Models Using Machine Learning.

作者信息

Upadhyaya Shriprabha R, Bayer Philipp E, Tay Fernandez Cassandria G, Petereit Jakob, Batley Jacqueline, Bennamoun Mohammed, Boussaid Farid, Edwards David

机构信息

School of Biological Sciences, University of Western Australia, Perth, WA 6000, Australia.

Department of Computer Science and Software Engineering, University of Western Australia, Perth, WA 6000, Australia.

出版信息

Plants (Basel). 2022 Jun 20;11(12):1619. doi: 10.3390/plants11121619.

Abstract

Gene models are regions of the genome that can be transcribed into RNA and translated to proteins, or belong to a class of non-coding RNA genes. The prediction of gene models is a complex process that can be unreliable, leading to false positive annotations. To help support the calling of confident conserved gene models and minimize false positives arising during gene model prediction we have developed Truegene, a machine learning approach to classify potential low confidence gene models using 14 gene and 41 protein-based characteristics. Amino acid and nucleotide sequence-based features were calculated for conserved (high confidence) and non-conserved (low confidence) annotated genes from the published Cameor genome. These features were used to train eXtreme Gradient Boost (XGBoost) classifier models to predict whether a gene model is likely to be real. The optimized models demonstrated a prediction accuracy ranging from 87% to 90% and an F-1 score of 0.91-0.94. We used SHapley Additive exPlanations (SHAP) and feature importance plots to identify the features that contribute to the model predictions, and we show that protein and gene-based features can be used to build accurate models for gene prediction that have applications in supporting future gene annotation processes.

摘要

基因模型是基因组中可转录为RNA并翻译为蛋白质的区域,或者属于一类非编码RNA基因。基因模型的预测是一个复杂的过程,可能不可靠,会导致错误的阳性注释。为了帮助支持可靠的保守基因模型的识别,并尽量减少基因模型预测过程中出现的假阳性,我们开发了Truegene,这是一种机器学习方法,使用14个基于基因和41个基于蛋白质的特征对潜在的低置信度基因模型进行分类。从已发表的Cameor基因组中计算出保守(高置信度)和非保守(低置信度)注释基因的基于氨基酸和核苷酸序列的特征。这些特征用于训练极端梯度提升(XGBoost)分类器模型,以预测基因模型是否可能是真实的。优化后的模型显示预测准确率在87%至90%之间,F-1分数为0.91-0.94。我们使用SHapley加法解释(SHAP)和特征重要性图来识别对模型预测有贡献的特征,并且我们表明基于蛋白质和基因的特征可用于构建准确的基因预测模型,这些模型可用于支持未来的基因注释过程。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/63f1/9230120/2d09f90f012e/plants-11-01619-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验