Suppr超能文献

CGRclust:用于未标记DNA序列双对比聚类的混沌游戏表示法

CGRclust: Chaos Game Representation for twin contrastive clustering of unlabelled DNA sequences.

作者信息

Alipour Fatemeh, Hill Kathleen A, Kari Lila

机构信息

School of Computer Science, University of Waterloo, Waterloo, Canada.

Department of Biology, University of Western Ontario, London, Canada.

出版信息

BMC Genomics. 2024 Dec 18;25(1):1214. doi: 10.1186/s12864-024-11135-y.

Abstract

BACKGROUND

Traditional supervised learning methods applied to DNA sequence taxonomic classification rely on the labor-intensive and time-consuming step of labelling the primary DNA sequences. Additionally, standard DNA classification/clustering methods involve time-intensive multiple sequence alignments, which impacts their applicability to large genomic datasets or distantly related organisms. These limitations indicate a need for robust, efficient, and scalable unsupervised DNA sequence clustering methods that do not depend on sequence labels or alignment.

RESULTS

This study proposes CGRclust, a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs). To the best of our knowledge, CGRclust is the first method to use unsupervised learning for image classification (herein applied to two-dimensional CGR images) for clustering datasets of DNA sequences. CGRclust overcomes the limitations of traditional sequence classification methods by leveraging unsupervised twin contrastive learning to detect distinctive sequence patterns, without requiring DNA sequence alignment or biological/taxonomic labels. CGRclust accurately clustered twenty-five diverse datasets, with sequence lengths ranging from 664 bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as well as viral whole genome assemblies and synthetic DNA sequences. Compared with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy across all four taxonomic levels tested for mitochondrial DNA genomes of fish. Moreover, CGRclust also consistently demonstrates superior performance across all the viral genomic datasets. The high clustering accuracy of CGRclust on these twenty-five datasets, which vary significantly in terms of sequence length, number of genomes, number of clusters, and level of taxonomy, demonstrates its robustness, scalability, and versatility.

CONCLUSION

CGRclust is a novel, scalable, alignment-free DNA sequence clustering method that uses CGR images of DNA sequences and CNNs for twin contrastive clustering of unlabelled primary DNA sequences, achieving superior or comparable accuracy and performance over current approaches. CGRclust demonstrated enhanced reliability, by consistently achieving over 80% accuracy in more than 90% of the datasets analyzed. In particular, CGRclust performed especially well in clustering viral DNA datasets, where it consistently outperformed all competing methods.

摘要

背景

应用于DNA序列分类的传统监督学习方法依赖于对原始DNA序列进行标记这一耗时费力的步骤。此外,标准的DNA分类/聚类方法涉及耗时的多序列比对,这影响了它们对大型基因组数据集或远缘生物的适用性。这些局限性表明需要一种强大、高效且可扩展的无监督DNA序列聚类方法,该方法不依赖于序列标签或比对。

结果

本研究提出了CGRclust,它是DNA序列的混沌游戏表示(CGR)的无监督孪生对比聚类与卷积神经网络(CNN)的一种新颖组合。据我们所知,CGRclust是第一种将无监督学习用于图像分类(此处应用于二维CGR图像)以对DNA序列数据集进行聚类的方法。CGRclust通过利用无监督孪生对比学习来检测独特的序列模式,克服了传统序列分类方法的局限性,无需DNA序列比对或生物学/分类学标签。CGRclust准确地对25个不同的数据集进行了聚类,序列长度从664 bp到100 kbp不等,包括鱼类、真菌和原生生物的线粒体基因组,以及病毒全基因组组装体和合成DNA序列。与最近的三种DNA序列聚类方法(DeLUCS、iDeLUCS和MeShClust v3.0)相比,CGRclust是唯一一种在对鱼类线粒体DNA基因组测试的所有四个分类水平上准确率超过81.70%的方法。此外,CGRclust在所有病毒基因组数据集上也始终表现出卓越的性能。CGRclust在这25个数据集上的高聚类准确率,这些数据集在序列长度、基因组数量、聚类数量和分类水平方面差异显著,证明了其稳健性、可扩展性和通用性。

结论

CGRclust是一种新颖、可扩展、无需比对的DNA序列聚类方法,它使用DNA序列的CGR图像和CNN对未标记的原始DNA序列进行孪生对比聚类,在准确性和性能上优于或可比当前方法。CGRclust通过在超过90%的分析数据集中始终达到80%以上的准确率,证明了更高的可靠性。特别是,CGRclust在聚类病毒DNA数据集方面表现尤其出色,始终优于所有竞争方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1487/11657719/f5f6ee5f1d93/12864_2024_11135_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验