Oh Jeongsu, Choi Chi-Hwan, Park Min-Kyu, Kim Byung Kwon, Hwang Kyuin, Lee Sang-Heon, Hong Soon Gyu, Nasir Arshan, Cho Wan-Sup, Kim Kyung Mo
Microbial Resource Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon, Republic of Korea.
Department of Bio-Information Technology, Chungbuk National University, CheongJu, Republic of Korea.
PLoS One. 2016 Mar 8;11(3):e0151064. doi: 10.1371/journal.pone.0151064. eCollection 2016.
High-throughput sequencing can produce hundreds of thousands of 16S rRNA sequence reads corresponding to different organisms present in the environmental samples. Typically, analysis of microbial diversity in bioinformatics starts from pre-processing followed by clustering 16S rRNA reads into relatively fewer operational taxonomic units (OTUs). The OTUs are reliable indicators of microbial diversity and greatly accelerate the downstream analysis time. However, existing hierarchical clustering algorithms that are generally more accurate than greedy heuristic algorithms struggle with large sequence datasets. To keep pace with the rapid rise in sequencing data, we present CLUSTOM-CLOUD, which is the first distributed sequence clustering program based on In-Memory Data Grid (IMDG) technology-a distributed data structure to store all data in the main memory of multiple computing nodes. The IMDG technology helps CLUSTOM-CLOUD to enhance both its capability of handling larger datasets and its computational scalability better than its ancestor, CLUSTOM, while maintaining high accuracy. Clustering speed of CLUSTOM-CLOUD was evaluated on published 16S rRNA human microbiome sequence datasets using the small laboratory cluster (10 nodes) and under the Amazon EC2 cloud-computing environments. Under the laboratory environment, it required only ~3 hours to process dataset of size 200 K reads regardless of the complexity of the human microbiome data. In turn, one million reads were processed in approximately 20, 14, and 11 hours when utilizing 20, 30, and 40 nodes on the Amazon EC2 cloud-computing environment. The running time evaluation indicates that CLUSTOM-CLOUD can handle much larger sequence datasets than CLUSTOM and is also a scalable distributed processing system. The comparative accuracy test using 16S rRNA pyrosequences of a mock community shows that CLUSTOM-CLOUD achieves higher accuracy than DOTUR, mothur, ESPRIT-Tree, UCLUST and Swarm. CLUSTOM-CLOUD is written in JAVA and is freely available at http://clustomcloud.kopri.re.kr.
高通量测序可以产生数十万条16S rRNA序列读数,这些读数对应于环境样本中存在的不同生物体。通常,生物信息学中微生物多样性的分析从预处理开始,然后将16S rRNA读数聚类为相对较少的操作分类单元(OTU)。OTU是微生物多样性的可靠指标,并大大加快了下游分析时间。然而,现有的层次聚类算法通常比贪婪启发式算法更准确,但在处理大型序列数据集时存在困难。为了跟上测序数据的快速增长,我们提出了CLUSTOM-CLOUD,它是第一个基于内存数据网格(IMDG)技术的分布式序列聚类程序——一种将所有数据存储在多个计算节点主内存中的分布式数据结构。IMDG技术帮助CLUSTOM-CLOUD在保持高精度的同时,比其前身CLUSTOM更好地增强了处理更大数据集的能力和计算可扩展性。使用小型实验室集群(10个节点)并在亚马逊EC2云计算环境下,对已发表的16S rRNA人类微生物组序列数据集评估了CLUSTOM-CLOUD的聚类速度。在实验室环境下,无论人类微生物组数据的复杂性如何,处理大小为200K读数的数据集仅需约3小时。反过来,在亚马逊EC2云计算环境下使用20、30和40个节点时,处理100万个读数分别需要大约20、14和11小时。运行时间评估表明,CLUSTOM-CLOUD可以处理比CLUSTOM大得多的序列数据集,并且也是一个可扩展的分布式处理系统。使用模拟群落的16S rRNA焦磷酸测序进行的比较准确性测试表明,CLUSTOM-CLOUD比DOTUR、mothur、ESPRIT-Tree、UCLUST和Swarm具有更高的准确性。CLUSTOM-CLOUD用Java编写,可在http://clustomcloud.kopri.re.kr免费获取。