宏基因组片段中的基因预测：一种大规模机器学习方法。

Gene prediction in metagenomic fragments: a large scale machine learning approach.

作者信息

Hoff Katharina J, Tech Maike, Lingner Thomas, Daniel Rolf, Morgenstern Burkhard, Meinicke Peter

机构信息

Abteilung Bioinformatik, Georg-August-Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany.

出版信息

BMC Bioinformatics. 2008 Apr 28;9:217. doi: 10.1186/1471-2105-9-217.

DOI:10.1186/1471-2105-9-217

PMID:18442389

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2409338/

Abstract

BACKGROUND

Metagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in their infancy. In contrast to genomic sequences of single species, which can usually be assembled and analyzed by many available methods, a large proportion of metagenome data remains as unassembled anonymous sequencing reads. One of the aims of all metagenomic sequencing projects is the identification of novel genes. Short length, for example, Sanger sequencing yields on average 700 bp fragments, and unknown phylogenetic origin of most fragments require approaches to gene prediction that are different from the currently available methods for genomes of single species. In particular, the large size of metagenomic samples requires fast and accurate methods with small numbers of false positive predictions.

RESULTS

We introduce a novel gene prediction algorithm for metagenomic fragments based on a two-stage machine learning approach. In the first stage, we use linear discriminants for monocodon usage, dicodon usage and translation initiation sites to extract features from DNA sequences. In the second stage, an artificial neural network combines these features with open reading frame length and fragment GC-content to compute the probability that this open reading frame encodes a protein. This probability is used for the classification and scoring of gene candidates. With large scale training, our method provides fast single fragment predictions with good sensitivity and specificity on artificially fragmented genomic DNA. Additionally, this method is able to predict translation initiation sites accurately and distinguishes complete from incomplete genes with high reliability.

CONCLUSION

Large scale machine learning methods are well-suited for gene prediction in metagenomic DNA fragments. In particular, the combination of linear discriminants and neural networks is promising and should be considered for integration into metagenomic analysis pipelines. The data sets can be downloaded from the URL provided (see Availability and requirements section).

摘要

背景

宏基因组学是一种通过直接从环境中分离基因组序列而无需事先培养来表征微生物基因组的方法。宏基因组序列数据量增长迅速，而宏基因组分析的计算方法仍处于起步阶段。与通常可以通过许多现有方法进行组装和分析的单一物种基因组序列不同，很大一部分宏基因组数据仍然是未组装的匿名测序读数。所有宏基因组测序项目的目标之一是鉴定新基因。例如，短读长的桑格测序平均产生700bp的片段，并且大多数片段的系统发育起源未知，这需要不同于目前用于单一物种基因组的基因预测方法。特别是，宏基因组样本的大尺寸需要快速准确且假阳性预测数量少的方法。

结果

我们基于两阶段机器学习方法引入了一种用于宏基因组片段的新型基因预测算法。在第一阶段，我们使用单密码子使用、双密码子使用和翻译起始位点的线性判别式从DNA序列中提取特征。在第二阶段，人工神经网络将这些特征与开放阅读框长度和片段GC含量相结合，以计算该开放阅读框编码蛋白质的概率。该概率用于基因候选物的分类和评分。通过大规模训练，我们的方法在人工片段化的基因组DNA上提供了具有良好敏感性和特异性的快速单片段预测。此外，该方法能够准确预测翻译起始位点，并以高可靠性区分完整基因和不完整基因。

结论

大规模机器学习方法非常适合宏基因组DNA片段中的基因预测。特别是，线性判别式和神经网络的结合很有前景，应考虑将其集成到宏基因组分析流程中。数据集可从提供的URL下载（见可用性和要求部分）。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d0c5/2409338/cc3dc038c00e/1471-2105-9-217-1.jpg

相似文献

Gene prediction in metagenomic fragments: a large scale machine learning approach.宏基因组片段中的基因预测：一种大规模机器学习方法。

BMC Bioinformatics. 2008 Apr 28;9:217. doi: 10.1186/1471-2105-9-217.

MGC: a metagenomic gene caller.MGC：一种宏基因组基因调用器。

BMC Bioinformatics. 2013;14 Suppl 9(Suppl 9):S6. doi: 10.1186/1471-2105-14-S9-S6. Epub 2013 Jun 28.

Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering.通过增量聚类对微生物宏基因组序列数据进行基因识别和蛋白质分类。

BMC Bioinformatics. 2008 Apr 10;9:182. doi: 10.1186/1471-2105-9-182.

Gene prediction in metagenomic fragments based on the SVM algorithm.基于 SVM 算法的宏基因组片段基因预测。

BMC Bioinformatics. 2013;14 Suppl 5(Suppl 5):S12. doi: 10.1186/1471-2105-14-S5-S12. Epub 2013 Apr 10.

Finding novel genes in bacterial communities isolated from the environment.在从环境中分离出的细菌群落中寻找新基因。

Bioinformatics. 2006 Jul 15;22(14):e281-9. doi: 10.1093/bioinformatics/btl247.

MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes.MED：一种用于细菌和古细菌基因组的新型无监督基因预测算法。

BMC Bioinformatics. 2007 Mar 16;8:97. doi: 10.1186/1471-2105-8-97.

Hon-yaku: a biology-driven Bayesian methodology for identifying translation initiation sites in prokaryotes.Hon-yaku：一种用于识别原核生物翻译起始位点的生物学驱动的贝叶斯方法。

BMC Bioinformatics. 2007 Feb 8;8:47. doi: 10.1186/1471-2105-8-47.

Discovering cis-regulatory RNAs in Shewanella genomes by Support Vector Machines.利用支持向量机在希瓦氏菌基因组中发现顺式调控RNA

PLoS Comput Biol. 2009 Apr;5(4):e1000338. doi: 10.1371/journal.pcbi.1000338. Epub 2009 Apr 3.

Gene function prediction based on genomic context clustering and discriminative learning: an application to bacteriophages.基于基因组上下文聚类和判别学习的基因功能预测：在噬菌体中的应用

BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S6. doi: 10.1186/1471-2105-8-S4-S6.

Gene Prediction in Metagenomic Fragments with Deep Learning.利用深度学习进行宏基因组片段中的基因预测

Biomed Res Int. 2017;2017:4740354. doi: 10.1155/2017/4740354. Epub 2017 Nov 8.

引用本文的文献

A toolbox of machine learning software to support microbiome analysis.一个支持微生物组分析的机器学习软件工具箱。

Front Microbiol. 2023 Nov 22;14:1250806. doi: 10.3389/fmicb.2023.1250806. eCollection 2023.

gene prediction for protein-coding regions.蛋白质编码区域的基因预测。

Bioinform Adv. 2023 Aug 10;3(1):vbad105. doi: 10.1093/bioadv/vbad105. eCollection 2023.

Application of Deep Learning in Plant-Microbiota Association Analysis.深度学习在植物-微生物群关联分析中的应用。

Front Genet. 2021 Oct 8;12:697090. doi: 10.3389/fgene.2021.697090. eCollection 2021.

Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses.短 k- -mer 丰度谱为 RNA 病毒提供了强大的机器学习特征和准确的分类器。

PLoS One. 2020 Sep 18;15(9):e0239381. doi: 10.1371/journal.pone.0239381. eCollection 2020.

GPRED-GC: a Gene PREDiction model accounting for 5 - 3 GC gradient.GPRED-GC：一种考虑 5-3GC 梯度的基因预测模型。

BMC Bioinformatics. 2019 Dec 24;20(Suppl 15):482. doi: 10.1186/s12859-019-3047-3.

Comparison of Bacterial Populations in the Ceca of Swine at Two Different Stages and their Functional Annotations.比较两个不同阶段猪盲肠中的细菌种群及其功能注释。

Genes (Basel). 2019 May 20;10(5):382. doi: 10.3390/genes10050382.

Low-dimensional representation of genomic sequences.基因组序列的低维表示

J Math Biol. 2019 Jul;79(1):1-29. doi: 10.1007/s00285-019-01348-1. Epub 2019 Mar 30.

CNN-MGP: Convolutional Neural Networks for Metagenomics Gene Prediction.CNN-MGP：用于宏基因组基因预测的卷积神经网络。

Interdiscip Sci. 2019 Dec;11(4):628-635. doi: 10.1007/s12539-018-0313-4. Epub 2018 Dec 27.

Composition Analysis and Feature Selection of the Oral Microbiota Associated with Periodontal Disease.与牙周病相关的口腔微生物群落组成分析及特征选择。

Biomed Res Int. 2018 Nov 15;2018:3130607. doi: 10.1155/2018/3130607. eCollection 2018.

Feature selection for gene prediction in metagenomic fragments.宏基因组片段中基因预测的特征选择

BioData Min. 2018 Jun 7;11:9. doi: 10.1186/s13040-018-0170-z. eCollection 2018.

本文引用的文献

GenBank.基因银行

Nucleic Acids Res. 2007 Jan;35(Database issue):D21-5. doi: 10.1093/nar/gkl986.

MetaGene: prokaryotic gene finding from environmental genome shotgun sequences.MetaGene：从环境基因组鸟枪法测序中寻找原核生物基因

Nucleic Acids Res. 2006;34(19):5623-30. doi: 10.1093/nar/gkl723. Epub 2006 Oct 5.

Finding novel genes in bacterial communities isolated from the environment.在从环境中分离出的细菌群落中寻找新基因。

Bioinformatics. 2006 Jul 15;22(14):e281-9. doi: 10.1093/bioinformatics/btl247.

TICO: a tool for postprocessing the predictions of prokaryotic translation initiation sites.TICO：一种用于后处理原核生物翻译起始位点预测结果的工具。

Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W588-90. doi: 10.1093/nar/gkl313.

Using pyrosequencing to shed light on deep mine microbial ecology.利用焦磷酸测序技术揭示深部矿井微生物生态。

BMC Genomics. 2006 Mar 20;7:57. doi: 10.1186/1471-2164-7-57.

An unsupervised classification scheme for improving predictions of prokaryotic TIS.一种用于改进原核生物翻译起始位点预测的无监督分类方案。

BMC Bioinformatics. 2006 Mar 9;7:121. doi: 10.1186/1471-2105-7-121.

Large-scale prokaryotic gene prediction and comparison to genome annotation.大规模原核生物基因预测及与基因组注释的比较。

Bioinformatics. 2005 Dec 15;21(24):4322-9. doi: 10.1093/bioinformatics/bti701. Epub 2005 Oct 25.

Bioinformatics for whole-genome shotgun sequencing of microbial communities.用于微生物群落全基因组鸟枪法测序的生物信息学

PLoS Comput Biol. 2005 Jul;1(2):106-12. doi: 10.1371/journal.pcbi.0010024.

TICO: a tool for improving predictions of prokaryotic translation initiation sites.TICO：一种用于改进原核生物翻译起始位点预测的工具。

Bioinformatics. 2005 Sep 1;21(17):3568-9. doi: 10.1093/bioinformatics/bti563. Epub 2005 Jun 30.

The metagenomics of soil.土壤宏基因组学

Nat Rev Microbiol. 2005 Jun;3(6):470-8. doi: 10.1038/nrmicro1160.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

宏基因组片段中的基因预测：一种大规模机器学习方法。

Gene prediction in metagenomic fragments: a large scale machine learning approach.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献