Shandong Key Laboratory of Energy Genetics, CAS Key Laboratory of Biofuels and BioEnergy Genome Center, Qingdao Institute of Bioenergy and Bioprocess Technology, Chinese Academy of Sciences, Qingdao 266101, Shandong Province, People's Republic of China.
Bioinformatics. 2012 Oct 1;28(19):2493-501. doi: 10.1093/bioinformatics/bts470. Epub 2012 Jul 26.
It has long been intriguing scientists to effectively compare different microbial communities (also referred as 'metagenomic samples' here) in a large scale: given a set of unknown samples, find similar metagenomic samples from a large repository and examine how similar these samples are. With the current metagenomic samples accumulated, it is possible to build a database of metagenomic samples of interests. Any metagenomic samples could then be searched against this database to find the most similar metagenomic sample(s). However, on one hand, current databases with a large number of metagenomic samples mostly serve as data repositories that offer few functionalities for analysis; and on the other hand, methods to measure the similarity of metagenomic data work well only for small set of samples by pairwise comparison. It is not yet clear, how to efficiently search for metagenomic samples against a large metagenomic database.
In this study, we have proposed a novel method, Meta-Storms, that could systematically and efficiently organize and search metagenomic data. It includes the following components: (i) creating a database of metagenomic samples based on their taxonomical annotations, (ii) efficient indexing of samples in the database based on a hierarchical taxonomy indexing strategy, (iii) searching for a metagenomic sample against the database by a fast scoring function based on quantitative phylogeny and (iv) managing database by index export, index import, data insertion, data deletion and database merging. We have collected more than 1300 metagenomic data from the public domain and in-house facilities, and tested the Meta-Storms method on these datasets. Our experimental results show that Meta-Storms is capable of database creation and effective searching for a large number of metagenomic samples, and it could achieve similar accuracies compared with the current popular significance testing-based methods.
Meta-Storms method would serve as a suitable database management and search system to quickly identify similar metagenomic samples from a large pool of samples.
Supplementary data are available at Bioinformatics online.
长期以来,科学家们一直对在大规模范围内有效地比较不同的微生物群落(这里也称为“宏基因组样本”)很感兴趣:给定一组未知的样本,从大型存储库中找到相似的宏基因组样本,并检查这些样本的相似程度。随着当前宏基因组样本的积累,可以构建一个感兴趣的宏基因组样本数据库。任何宏基因组样本都可以针对该数据库进行搜索,以找到最相似的宏基因组样本。然而,一方面,当前具有大量宏基因组样本的数据库主要作为数据存储库,提供的分析功能很少;另一方面,用于测量宏基因组数据相似性的方法仅在通过两两比较对小样本集时效果很好。目前尚不清楚如何有效地针对大型宏基因组数据库搜索宏基因组样本。
在这项研究中,我们提出了一种新的方法 Meta-Storms,它可以系统地和有效地组织和搜索宏基因组数据。它包括以下几个部分:(i)基于分类注释创建宏基因组样本数据库,(ii)基于层次分类索引策略对数据库中的样本进行高效索引,(iii)通过基于定量系统发育的快速评分函数搜索数据库中的宏基因组样本,以及(iv)通过索引导出、索引导入、数据插入、数据删除和数据库合并来管理数据库。我们从公共领域和内部设施收集了超过 1300 个宏基因组数据,并在这些数据集上测试了 Meta-Storms 方法。我们的实验结果表明,Meta-Storms 方法能够创建数据库并有效地搜索大量宏基因组样本,并且与当前流行的基于显著性检验的方法相比,它可以达到类似的准确性。
Meta-Storms 方法将作为一个合适的数据库管理和搜索系统,用于从大量样本中快速识别相似的宏基因组样本。
补充数据可在《生物信息学》在线获取。