Wei Ze-Gang, Zhang Shao-Wu
Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, China.
Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Science, Baoji, China.
Front Microbiol. 2019 Mar 12;10:428. doi: 10.3389/fmicb.2019.00428. eCollection 2019.
Next-generation sequencing (NGS)-based 16S rRNA sequencing by jointly using the PCR amplification and NGS technology is a cost-effective technique, which has been successfully used to study the phylogeny and taxonomy of samples from complex microbiomes or environments. Clustering 16S rRNA sequences into operational taxonomic units (OTUs) is often the first step for many downstream analyses. Heuristic clustering is one of the most widely employed approaches for generating OTUs. However, most heuristic OTUs clustering methods just select one single seed sequence to represent each cluster, resulting in their outcomes suffer from either overestimation of OTUs number or sensitivity to sequencing errors. In this paper, we present a novel dynamic multi-seeds clustering method (namely DMSC) to pick OTUs. DMSC first heuristically generates clusters according to the distance threshold. When the size of a cluster reaches the pre-defined minimum size, then DMSC selects the multi-core sequences (MCS) as the seeds that are defined as the -core sequences ( ≥ 3), in which the distance between any two sequences is less than the distance threshold. A new sequence is assigned to the corresponding cluster depending on the average distance to MCS and the distance standard deviation within the MCS. If a new sequence is added to the cluster, dynamically update the MCS until no sequence is merged into the cluster. The new method DMSC was tested on several simulated and real-life sequence datasets and also compared with the traditional heuristic methods such as CD-HIT, UCLUST, and DBH. Experimental results in terms of the inferred OTUs number, normalized mutual information (NMI) and Matthew correlation coefficient (MCC) metrics demonstrate that DMSC can produce higher quality clusters with low memory usage and reduce OTU overestimation. Additionally, DMSC is also robust to the sequencing errors. The DMSC software can be freely downloaded from https://github.com/NWPU-903PR/DMSC.
通过联合使用PCR扩增和二代测序(NGS)技术进行的基于二代测序的16S rRNA测序是一种经济高效的技术,已成功用于研究来自复杂微生物群落或环境的样本的系统发育和分类学。将16S rRNA序列聚类为操作分类单元(OTU)通常是许多下游分析的第一步。启发式聚类是生成OTU最广泛使用的方法之一。然而,大多数启发式OTU聚类方法只选择一个单一的种子序列来代表每个聚类,导致其结果要么高估了OTU数量,要么对测序错误敏感。在本文中,我们提出了一种新颖的动态多种子聚类方法(即DMSC)来挑选OTU。DMSC首先根据距离阈值启发式地生成聚类。当一个聚类的大小达到预定义的最小大小时,DMSC选择多核序列(MCS)作为种子,这些种子被定义为 - 核心序列(≥3),其中任意两个序列之间的距离小于距离阈值。根据到MCS的平均距离和MCS内的距离标准差将新序列分配到相应的聚类中。如果将新序列添加到聚类中,则动态更新MCS,直到没有序列合并到该聚类中。新方法DMSC在几个模拟和实际序列数据集上进行了测试,并与传统的启发式方法如CD-HIT、UCLUST和DBH进行了比较。根据推断的OTU数量、归一化互信息(NMI)和马修相关系数(MCC)指标的实验结果表明,DMSC可以以低内存使用量产生更高质量的聚类,并减少OTU高估。此外,DMSC对测序错误也具有鲁棒性。DMSC软件可从https://github.com/NWPU-903PR/DMSC免费下载。