认知器：宏基因组数据集功能注释框架

COGNIZER: A Framework for Functional Annotation of Metagenomic Datasets.

作者信息

Bose Tungadri, Haque Mohammed Monzoorul, Reddy Cvsk, Mande Sharmila S

机构信息

Bio-Sciences R&D Division, TCS Innovation Labs, Tata Consultancy Services Limited, 54-B, Hadapsar Industrial Estate, Pune, 411013, Maharashtra, India.

出版信息

PLoS One. 2015 Nov 11;10(11):e0142102. doi: 10.1371/journal.pone.0142102. eCollection 2015.

DOI:10.1371/journal.pone.0142102

PMID:26561344

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4641738/

Abstract

BACKGROUND

Recent advances in sequencing technologies have resulted in an unprecedented increase in the number of metagenomes that are being sequenced world-wide. Given their volume, functional annotation of metagenomic sequence datasets requires specialized computational tools/techniques. In spite of having high accuracy, existing stand-alone functional annotation tools necessitate end-users to perform compute-intensive homology searches of metagenomic datasets against "multiple" databases prior to functional analysis. Although, web-based functional annotation servers address to some extent the problem of availability of compute resources, uploading and analyzing huge volumes of sequence data on a shared public web-service has its own set of limitations. In this study, we present COGNIZER, a comprehensive stand-alone annotation framework which enables end-users to functionally annotate sequences constituting metagenomic datasets. The COGNIZER framework provides multiple workflow options. A subset of these options employs a novel directed-search strategy which helps in reducing the overall compute requirements for end-users. The COGNIZER framework includes a cross-mapping database that enables end-users to simultaneously derive/infer KEGG, Pfam, GO, and SEED subsystem information from the COG annotations.

RESULTS

Validation experiments performed with real-world metagenomes and metatranscriptomes, generated using diverse sequencing technologies, indicate that the novel directed-search strategy employed in COGNIZER helps in reducing the compute requirements without significant loss in annotation accuracy. A comparison of COGNIZER's results with pre-computed benchmark values indicate the reliability of the cross-mapping database employed in COGNIZER.

CONCLUSION

The COGNIZER framework is capable of comprehensively annotating any metagenomic or metatranscriptomic dataset from varied sequencing platforms in functional terms. Multiple search options in COGNIZER provide end-users the flexibility of choosing a homology search protocol based on available compute resources. The cross-mapping database in COGNIZER is of high utility since it enables end-users to directly infer/derive KEGG, Pfam, GO, and SEED subsystem annotations from COG categorizations. Furthermore, availability of COGNIZER as a stand-alone scalable implementation is expected to make it a valuable annotation tool in the field of metagenomic research.

AVAILABILITY AND IMPLEMENTATION

A Linux implementation of COGNIZER is freely available for download from the following links: http://metagenomics.atc.tcs.com/cognizer, https://metagenomics.atc.tcs.com/function/cognizer.

摘要

背景

测序技术的最新进展导致全球范围内正在测序的宏基因组数量空前增加。鉴于其数量庞大，宏基因组序列数据集的功能注释需要专门的计算工具/技术。尽管现有独立功能注释工具具有较高的准确性，但在功能分析之前，终端用户需要对宏基因组数据集与“多个”数据库进行计算密集型同源性搜索。虽然基于网络的功能注释服务器在一定程度上解决了计算资源可用性的问题，但在共享公共网络服务上上传和分析大量序列数据有其自身的局限性。在本研究中，我们展示了COGNIZER，这是一个全面的独立注释框架，使终端用户能够对构成宏基因组数据集的序列进行功能注释。COGNIZER框架提供了多个工作流程选项。这些选项的一个子集采用了一种新颖的定向搜索策略，有助于降低终端用户的总体计算需求。COGNIZER框架包括一个交叉映射数据库，使终端用户能够从COG注释中同时推导/推断KEGG、Pfam、GO和SEED子系统信息。

结果

使用不同测序技术生成的真实世界宏基因组和宏转录组进行的验证实验表明，COGNIZER中采用的新颖定向搜索策略有助于降低计算需求，而不会显著损失注释准确性。将COGNIZER的结果与预先计算的基准值进行比较，表明了COGNIZER中使用的交叉映射数据库的可靠性。

结论

COGNIZER框架能够从功能角度全面注释来自不同测序平台的任何宏基因组或宏转录组数据集。COGNIZER中的多个搜索选项为终端用户提供了根据可用计算资源选择同源性搜索协议的灵活性。COGNIZER中的交叉映射数据库非常实用，因为它使终端用户能够直接从COG分类中推断/推导KEGG、Pfam、GO和SEED子系统注释。此外，COGNIZER作为独立可扩展实现的可用性有望使其成为宏基因组研究领域有价值的注释工具。