Suppr超能文献

MGC:一种宏基因组基因调用器。

MGC: a metagenomic gene caller.

机构信息

Department of Computer Science and Engineering, University of South Carolina, 315 Main Street, Columbia, SC 29208, USA.

出版信息

BMC Bioinformatics. 2013;14 Suppl 9(Suppl 9):S6. doi: 10.1186/1471-2105-14-S9-S6. Epub 2013 Jun 28.

Abstract

BACKGROUND

Computational gene finding algorithms have proven their robustness in identifying genes in complete genomes. However, metagenomic sequencing has presented new challenges due to the incomplete and fragmented nature of the data. During the last few years, attempts have been made to extract complete and incomplete open reading frames (ORFs) directly from short reads and identify the coding ORFs, bypassing other challenging tasks such as the assembly of the metagenome.

RESULTS

In this paper we introduce a metagenomics gene caller (MGC) which is an improvement over the state-of-the-art prediction algorithm Orphelia. Orphelia uses a two-stage machine learning approach and computes a model that classifies extracted ORFs from fragmented sequences. We hypothesise and demonstrate evidence that sequences need separate models based on their local GC-content in order to avoid the noise introduced to a single model computed with sequences from the entire GC spectrum. We have also added two amino-acid features based on the benefit of amino-acid usage shown in our previous research. Our algorithm is able to predict genes and translation initiation sites (TIS) more accurately than Orphelia which uses a single model.

CONCLUSIONS

Learning separate models for several pre-defined GC-content regions as opposed to a single model approach improves the performance of the neural network as demonstrated by the experimental results presented in this paper. The inclusion of amino-acid usage features also helps improve the overall accuracy of our algorithm. MGC's improvement sets the ground for further investigation into the use of GC-content to separate data for training models in machine learning based gene finders.

摘要

背景

计算基因预测算法在鉴定完整基因组中的基因方面已被证明具有强大的功能。然而,由于数据的不完整和碎片化,宏基因组测序带来了新的挑战。在过去的几年中,人们尝试直接从短读序列中提取完整和不完整的开放阅读框(ORF),并识别编码 ORF,从而绕过其他具有挑战性的任务,例如宏基因组的组装。

结果

在本文中,我们引入了一种宏基因组基因预测器(MGC),它是对最先进的预测算法 Orphelia 的改进。Orphelia 使用两阶段机器学习方法,并计算出一种分类从碎片化序列中提取的 ORF 的模型。我们假设并证明了基于序列局部 GC 含量为每个序列单独建立模型的必要性,以避免使用来自整个 GC 谱的序列计算单个模型时引入的噪声。我们还根据我们之前的研究中氨基酸使用的益处添加了两个氨基酸特征。与使用单个模型的 Orphelia 相比,我们的算法能够更准确地预测基因和翻译起始位点(TIS)。

结论

与单一模型方法相比,为几个预定义的 GC 含量区域学习单独的模型可以提高神经网络的性能,这一点可以通过本文提出的实验结果证明。氨基酸使用特征的包含也有助于提高我们算法的整体准确性。MGC 的改进为进一步研究使用 GC 含量来分离机器学习基因预测器中模型训练数据奠定了基础。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e83/3698006/f3ebcd2a565c/1471-2105-14-S9-S6-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验