Suppr超能文献

宏基因组片段中的基因预测:一种大规模机器学习方法。

Gene prediction in metagenomic fragments: a large scale machine learning approach.

作者信息

Hoff Katharina J, Tech Maike, Lingner Thomas, Daniel Rolf, Morgenstern Burkhard, Meinicke Peter

机构信息

Abteilung Bioinformatik, Georg-August-Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany.

出版信息

BMC Bioinformatics. 2008 Apr 28;9:217. doi: 10.1186/1471-2105-9-217.

Abstract

BACKGROUND

Metagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in their infancy. In contrast to genomic sequences of single species, which can usually be assembled and analyzed by many available methods, a large proportion of metagenome data remains as unassembled anonymous sequencing reads. One of the aims of all metagenomic sequencing projects is the identification of novel genes. Short length, for example, Sanger sequencing yields on average 700 bp fragments, and unknown phylogenetic origin of most fragments require approaches to gene prediction that are different from the currently available methods for genomes of single species. In particular, the large size of metagenomic samples requires fast and accurate methods with small numbers of false positive predictions.

RESULTS

We introduce a novel gene prediction algorithm for metagenomic fragments based on a two-stage machine learning approach. In the first stage, we use linear discriminants for monocodon usage, dicodon usage and translation initiation sites to extract features from DNA sequences. In the second stage, an artificial neural network combines these features with open reading frame length and fragment GC-content to compute the probability that this open reading frame encodes a protein. This probability is used for the classification and scoring of gene candidates. With large scale training, our method provides fast single fragment predictions with good sensitivity and specificity on artificially fragmented genomic DNA. Additionally, this method is able to predict translation initiation sites accurately and distinguishes complete from incomplete genes with high reliability.

CONCLUSION

Large scale machine learning methods are well-suited for gene prediction in metagenomic DNA fragments. In particular, the combination of linear discriminants and neural networks is promising and should be considered for integration into metagenomic analysis pipelines. The data sets can be downloaded from the URL provided (see Availability and requirements section).

摘要

背景

宏基因组学是一种通过直接从环境中分离基因组序列而无需事先培养来表征微生物基因组的方法。宏基因组序列数据量增长迅速,而宏基因组分析的计算方法仍处于起步阶段。与通常可以通过许多现有方法进行组装和分析的单一物种基因组序列不同,很大一部分宏基因组数据仍然是未组装的匿名测序读数。所有宏基因组测序项目的目标之一是鉴定新基因。例如,短读长的桑格测序平均产生700bp的片段,并且大多数片段的系统发育起源未知,这需要不同于目前用于单一物种基因组的基因预测方法。特别是,宏基因组样本的大尺寸需要快速准确且假阳性预测数量少的方法。

结果

我们基于两阶段机器学习方法引入了一种用于宏基因组片段的新型基因预测算法。在第一阶段,我们使用单密码子使用、双密码子使用和翻译起始位点的线性判别式从DNA序列中提取特征。在第二阶段,人工神经网络将这些特征与开放阅读框长度和片段GC含量相结合,以计算该开放阅读框编码蛋白质的概率。该概率用于基因候选物的分类和评分。通过大规模训练,我们的方法在人工片段化的基因组DNA上提供了具有良好敏感性和特异性的快速单片段预测。此外,该方法能够准确预测翻译起始位点,并以高可靠性区分完整基因和不完整基因。

结论

大规模机器学习方法非常适合宏基因组DNA片段中的基因预测。特别是,线性判别式和神经网络的结合很有前景,应考虑将其集成到宏基因组分析流程中。数据集可从提供的URL下载(见可用性和要求部分)。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d0c5/2409338/cc3dc038c00e/1471-2105-9-217-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验