Suppr超能文献

MeShClust v3.0:使用均值漂移算法和无比对身份分数对 DNA 序列进行高质量聚类。

MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores.

机构信息

Bioinformatics Toolsmith Laboratory, Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX, USA.

出版信息

BMC Genomics. 2022 Jun 6;23(1):423. doi: 10.1186/s12864-022-08619-0.

Abstract

BACKGROUND

Tools for accurately clustering biological sequences are among the most important tools in computational biology. Two pioneering tools for clustering sequences are CD-HIT and UCLUST, both of which are fast and consume reasonable amounts of memory; however, there is a big room for improvement in terms of cluster quality. Motivated by this opportunity for improving cluster quality, we applied the mean shift algorithm in MeShClust v1.0. The mean shift algorithm is an instance of unsupervised learning. Its strong theoretical foundation guarantees the convergence to the true cluster centers. Our implementation of the mean shift algorithm in MeShClust v1.0 was a step forward. In this work, we scale up the algorithm by adapting an out-of-core strategy while utilizing alignment-free identity scores in a new tool: MeShClust v3.0.

RESULTS

We evaluated CD-HIT, MeShClust v1.0, MeShClust v3.0, and UCLUST on 22 synthetic sets and five real sets. These data sets were designed or selected for testing the tools in terms of scalability and different similarity levels among sequences comprising clusters. On the synthetic data sets, MeShClust v3.0 outperformed the related tools on all sets in terms of cluster quality. On two real data sets obtained from human microbiome and maize transposons, MeShClust v3.0 outperformed the related tools by wide margins, achieving 55%-300% improvement in cluster quality. On another set that includes degenerate viral sequences, MeShClust v3.0 came third. On two bacterial sets, MeShClust v3.0 was the only applicable tool because of the long sequences in these sets. MeShClust v3.0 requires more time and memory than the related tools; almost all personal computers at the time of this writing can accommodate such requirements. MeShClust v3.0 can estimate an important parameter that controls cluster membership with high accuracy.

CONCLUSIONS

These results demonstrate the high quality of clusters produced by MeShClust v3.0 and its ability to apply the mean shift algorithm to large data sets and long sequences. Because clustering tools are utilized in many studies, providing high-quality clusters will help with deriving accurate biological knowledge.

摘要

背景

准确聚类生物序列的工具是计算生物学中最重要的工具之一。用于聚类序列的两个开创性工具是 CD-HIT 和 UCLUST,它们都快速且消耗合理数量的内存;然而,在聚类质量方面还有很大的改进空间。受此机会启发,我们在 MeShClust v1.0 中应用了均值漂移算法。均值漂移算法是无监督学习的一个实例。其强大的理论基础保证了向真实聚类中心的收敛。我们在 MeShClust v1.0 中实现均值漂移算法是向前迈进了一步。在这项工作中,我们通过采用一种基于核外策略的方法来扩展算法,同时在一个新工具中利用无比对身份分数:MeShClust v3.0。

结果

我们在 22 个合成数据集和 5 个真实数据集上评估了 CD-HIT、MeShClust v1.0、MeShClust v3.0 和 UCLUST。这些数据集是专门设计或选择用于测试工具在可扩展性和组成聚类的序列之间不同相似性水平方面的工具。在合成数据集上,MeShClust v3.0 在所有数据集上的聚类质量方面都优于相关工具。在从人类微生物组和玉米转座子获得的两个真实数据集上,MeShClust v3.0 以很大的优势优于相关工具,在聚类质量方面提高了 55%-300%。在另一个包含退化病毒序列的数据集上,MeShClust v3.0 排名第三。在两个细菌数据集上,MeShClust v3.0 是唯一适用的工具,因为这些数据集的序列很长。MeShClust v3.0 需要比相关工具更多的时间和内存;几乎所有在本文撰写时的个人电脑都可以满足这些要求。MeShClust v3.0 可以高精度地估计控制聚类成员身份的重要参数。

结论

这些结果表明 MeShClust v3.0 生成的聚类质量很高,并且能够将均值漂移算法应用于大数据集和长序列。由于聚类工具在许多研究中都有使用,提供高质量的聚类将有助于得出准确的生物学知识。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e653/9171953/ae9505bd5f99/12864_2022_8619_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验