Laboratory for MetaSystems Research at RIKEN, Japan.
Brief Bioinform. 2012 Nov;13(6):711-27. doi: 10.1093/bib/bbs033. Epub 2012 Jul 6.
Metagenomic sequencing provides a unique opportunity to explore earth's limitless environments harboring scores of yet unknown and mostly unculturable microbes and other organisms. Functional analysis of the metagenomic data plays a central role in projects aiming to explore the most essential questions in microbiology, namely 'In a given environment, among the microbes present, what are they doing, and how are they doing it?' Toward this goal, several large-scale metagenomic projects have recently been conducted or are currently underway. Functional analysis of metagenomic data mainly suffers from the vast amount of data generated in these projects. The shear amount of data requires much computational time and storage space. These problems are compounded by other factors potentially affecting the functional analysis, including, sample preparation, sequencing method and average genome size of the metagenomic samples. In addition, the read-lengths generated during sequencing influence sequence assembly, gene prediction and subsequently the functional analysis. The level of confidence for functional predictions increases with increasing read-length. Usually, the most reliable functional annotations for metagenomic sequences are achieved using homology-based approaches against publicly available reference sequence databases. Here, we present an overview of the current state of functional analysis of metagenomic sequence data, bottlenecks frequently encountered and possible solutions in light of currently available resources and tools. Finally, we provide some examples of applications from recent metagenomic studies which have been successfully conducted in spite of the known difficulties.
宏基因组测序为探索地球上无数充满未知且大部分无法培养的微生物和其他生物的环境提供了独特的机会。对宏基因组数据的功能分析在旨在探索微生物学中最基本问题的项目中起着核心作用,即“在给定的环境中,存在哪些微生物,它们在做什么,以及它们是如何做的?”为了实现这一目标,最近已经或正在进行几个大型的宏基因组项目。宏基因组数据分析主要受到这些项目中生成的大量数据的影响。数据量之大需要大量的计算时间和存储空间。这些问题因其他潜在影响功能分析的因素而变得更加复杂,包括样本制备、测序方法和宏基因组样本的平均基因组大小。此外,测序过程中产生的读长会影响序列组装、基因预测,进而影响功能分析。随着读长的增加,功能预测的置信度也会增加。通常,使用基于同源性的方法对公共可用的参考序列数据库进行功能注释是获得最可靠的宏基因组序列功能注释的方法。在这里,我们概述了宏基因组序列数据功能分析的现状、经常遇到的瓶颈以及根据当前可用资源和工具提出的可能解决方案。最后,我们提供了一些成功进行的最近宏基因组研究的应用示例,尽管存在已知的困难,但这些示例都取得了成功。