State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, Beijing, China.
BMC Bioinformatics. 2013;14 Suppl 5(Suppl 5):S12. doi: 10.1186/1471-2105-14-S5-S12. Epub 2013 Apr 10.
Metagenomic sequencing is becoming a powerful technology for exploring micro-ogranisms from various environments, such as human body, without isolation and cultivation. Accurately identifying genes from metagenomic fragments is one of the most fundamental issues.
In this article, we present a novel gene prediction method named MetaGUN for metagenomic fragments based on a machine learning approach of SVM. It implements in a three-stage strategy to predict genes. Firstly, it classifies input fragments into phylogenetic groups by a k-mer based sequence binning method. Then, protein-coding sequences are identified for each group independently with SVM classifiers that integrate entropy density profiles (EDP) of codon usage, translation initiation site (TIS) scores and open reading frame (ORF) length as input patterns. Finally, the TISs are adjusted by employing a modified version of MetaTISA. To identify protein-coding sequences, MetaGun builds the universal module and the novel module. The former is based on a set of representative species, while the latter is designed to find potential functionary DNA sequences with conserved domains.
Comparisons on artificial shotgun fragments with multiple current metagenomic gene finders show that MetaGUN predicts better results on both 3' and 5' ends of genes with fragments of various lengths. Especially, it makes the most reliable predictions among these methods. As an application, MetaGUN was used to predict genes for two samples of human gut microbiome. It identifies thousands of additional genes with significant evidences. Further analysis indicates that MetaGUN tends to predict more potential novel genes than other current metagenomic gene finders.
宏基因组测序技术在不经过分离和培养的情况下,从人体等各种环境中探索微生物,已成为一种强大的技术。准确地从宏基因组片段中识别基因是最基本的问题之一。
本文提出了一种新的基于 SVM 的机器学习方法的宏基因组片段基因预测方法 MetaGUN。它采用三阶段策略来预测基因。首先,它通过基于 k-mer 的序列分箱方法将输入片段分类到进化群中。然后,针对每个群,使用集成密码子使用熵密度分布(EDP)、翻译起始位点(TIS)分数和开放阅读框(ORF)长度作为输入模式的 SVM 分类器来独立识别蛋白质编码序列。最后,通过使用修改后的 MetaTISA 调整 TIS。为了识别蛋白质编码序列,MetaGun 构建了通用模块和新颖模块。前者基于一组代表性物种,后者旨在找到具有保守结构域的潜在功能 DNA 序列。
在对具有多种当前宏基因组基因发现者的人工鸟枪法片段进行比较后,MetaGUN 在各种长度的片段的基因 3' 和 5' 端都能做出更好的预测。特别是,它在这些方法中做出了最可靠的预测。作为一种应用,MetaGUN 用于预测人类肠道微生物组的两个样本的基因。它识别出数千个具有显著证据的额外基因。进一步的分析表明,MetaGUN 倾向于预测比其他当前宏基因组基因发现者更多的潜在新基因。