School of EECS, Washington State University, 355 NE Spokane St, Pullman, 99164, USA.
Paul G. Allen School for Global Animal Health, Washington State University, Pullman, 99164, USA.
BMC Bioinformatics. 2018 Mar 5;19(1):83. doi: 10.1186/s12859-018-2080-y.
Clustering of protein sequences is of key importance in predicting the structure and function of newly sequenced proteins and is also of use for their annotation. With the advent of multiple high-throughput sequencing technologies, new protein sequences are becoming available at an extraordinary rate. The rapid growth rate has impeded deployment of existing protein clustering/annotation tools which depend largely on pairwise sequence alignment.
In this paper, we propose an alignment-free clustering approach, coreClust, for annotating protein sequences using detected conserved regions. The proposed algorithm uses Min-Wise Independent Hashing for identifying similar conserved regions. Min-Wise Independent Hashing works by generating a (w,c)-sketch for each document and comparing these sketches. Our algorithm fits well within the MapReduce framework, permitting scalability. We show that coreClust generates results comparable to existing known methods. In particular, we show that the clusters generated by our algorithm capture the subfamilies of the Pfam domain families for which the sequences in a cluster have a similar domain architecture. We show that for a data set of 90,000 sequences (about 250,000 domain regions), the clusters generated by our algorithm give a 75% average weighted F1 score, our accuracy metric, when compared to the clusters generated by a semi-exhaustive pairwise alignment algorithm.
The new clustering algorithm can be used to generate meaningful clusters of conserved regions. It is a scalable method that when paired with our prior work, NADDA for detecting conserved regions, provides a complete end-to-end pipeline for annotating protein sequences.
蛋白质序列聚类对于预测新测序蛋白质的结构和功能至关重要,对其注释也很有用。随着多种高通量测序技术的出现,新的蛋白质序列正以前所未有的速度出现。快速的增长率阻碍了现有蛋白质聚类/注释工具的部署,这些工具在很大程度上依赖于两两序列比对。
在本文中,我们提出了一种基于无比对聚类方法 coreClust,用于使用检测到的保守区域注释蛋白质序列。所提出的算法使用 Min-Wise Independent Hashing 来识别相似的保守区域。Min-Wise Independent Hashing 通过为每个文档生成一个(w,c)-sketch 并比较这些 sketch 来工作。我们的算法非常适合 MapReduce 框架,具有可扩展性。我们表明 coreClust 生成的结果可与现有已知方法相媲美。特别是,我们表明该算法生成的簇捕获了 Pfam 结构域家族的亚家族,其中簇中的序列具有相似的结构域架构。我们表明,对于 90000 个序列(约 250000 个结构域区域)的数据集,与基于半穷举两两比对算法生成的簇相比,我们算法生成的簇的平均加权 F1 分数为 75%,这是我们的准确性度量。
新的聚类算法可用于生成有意义的保守区域簇。它是一种可扩展的方法,与我们之前的 NADDA 检测保守区域的工作相结合,为注释蛋白质序列提供了一个完整的端到端管道。