Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, 2-12-1 W8-76 Ookayama, Meguro-ku, Tokyo 152-8550, Japan.
Education Academy of Computational Life Sciences (ACLS), Tokyo Institute of Technology, 4259 J3-141 Nagatsuta-cho, Midori-ku, Yokohama, Kanagawa 226-8503, Japan.
Int J Mol Sci. 2017 Oct 11;18(10):2124. doi: 10.3390/ijms18102124.
Sequence similarity searches have been widely used in the analyses of metagenomic sequencing data. Finding homologous sequences in a reference database enables the estimation of taxonomic and functional characteristics of each query sequence. Because current metagenomic sequencing data consist of a large number of nucleotide sequences, the time required for sequence similarity searches account for a large proportion of the total time. This time-consuming step makes it difficult to perform large-scale analyses. To analyze large-scale metagenomic data, such as those found in the human oral microbiome, we developed GHOST-MP (Genome-wide HOmology Search Tool on Massively Parallel system), a parallel sequence similarity search tool for massively parallel computing systems. This tool uses a fast search algorithm based on suffix arrays of query and database sequences and a hierarchical parallel search to accelerate the large-scale sequence similarity search of metagenomic sequencing data. The parallel computing efficiency and the search speed of this tool were evaluated. GHOST-MP was shown to be scalable over 10,000 CPU (Central Processing Unit) cores, and achieved over 80-fold acceleration compared with mpiBLAST using the same computational resources. We applied this tool to human oral metagenomic data, and the results indicate that the oral cavity, the oral vestibule, and plaque have different characteristics based on the functional gene category.
序列相似性搜索在宏基因组测序数据分析中得到了广泛应用。在参考数据库中查找同源序列,可以估计每个查询序列的分类和功能特征。由于当前的宏基因组测序数据包含大量的核苷酸序列,因此序列相似性搜索所需的时间占据了总时间的很大比例。这一耗时的步骤使得大规模分析变得困难。为了分析大规模的宏基因组数据,例如人类口腔微生物组中的数据,我们开发了 GHOST-MP(大规模并行系统上的全基因组同源搜索工具),这是一种用于大规模并行计算系统的并行序列相似性搜索工具。该工具使用基于查询和数据库序列后缀数组的快速搜索算法和分层并行搜索来加速宏基因组测序数据的大规模序列相似性搜索。评估了该工具的并行计算效率和搜索速度。结果表明,GHOST-MP 在超过 10000 个 CPU(中央处理器)核心上具有可扩展性,并且在使用相同计算资源时,与 mpiBLAST 相比,实现了超过 80 倍的加速。我们将该工具应用于人类口腔宏基因组数据,结果表明,口腔、口腔前庭和牙菌斑在功能基因类别上具有不同的特征。