• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用动态序列相似度阈值对生物序列进行聚类。

Clustering biological sequences with dynamic sequence similarity threshold.

机构信息

Saw Swee Hock School of Public Health, National University of Singapore and National University Health System, Singapore, 117549, Singapore.

出版信息

BMC Bioinformatics. 2022 Mar 30;23(1):108. doi: 10.1186/s12859-022-04643-9.

DOI:10.1186/s12859-022-04643-9
PMID:35354426
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8969259/
Abstract

BACKGROUND

Biological sequence clustering is a complicated data clustering problem owing to the high computation costs incurred for pairwise sequence distance calculations through sequence alignments, as well as difficulties in determining parameters for deriving robust clusters. While current approaches are successful in reducing the number of sequence alignments performed, the generated clusters are based on a single sequence identity threshold applied to every cluster. Poor choices of this identity threshold would thus lead to low quality clusters. There is however little support provided to users in selecting thresholds that are well matched with the input sequences.

RESULTS

We present a novel sequence clustering approach called ALFATClust that exploits rapid pairwise alignment-free sequence distance calculations and community detection in graph for clusters generation. Instead of a single threshold applied to every generated cluster, ALFATClust is capable of dynamically determining the cut-off threshold for each individual cluster by considering both cluster separation and intra-cluster sequence similarity. Benchmarking analysis shows that ALFATClust generally outperforms existing approaches by simultaneously maintaining cluster robustness and substantial cluster separation for the benchmark datasets. The software also provides an evaluation report for verifying the quality of the non-singleton clusters obtained.

CONCLUSIONS

ALFATClust is able to generate sequence clusters having high intra-cluster sequence similarity and substantial separation between clusters without having users to decide precise similarity cut-off thresholds.

摘要

背景

生物序列聚类是一个复杂的数据聚类问题,因为通过序列比对计算两两序列距离会产生很高的计算成本,并且难以确定用于得出稳健聚类的参数。虽然当前的方法成功地减少了执行的序列比对数量,但生成的聚类是基于应用于每个聚类的单个序列同一性阈值。因此,该同一性阈值选择不当会导致聚类质量较低。但是,在选择与输入序列匹配良好的阈值方面,用户几乎没有得到支持。

结果

我们提出了一种名为 ALFATClust 的新序列聚类方法,该方法利用快速的无序列比对的成对序列距离计算和图中的社区检测来生成聚类。与应用于每个生成的聚类的单个阈值不同,ALFATClust 能够通过考虑聚类分离和聚类内序列相似性,为每个单独的聚类动态确定截止阈值。基准分析表明,ALFATClust 通常通过同时保持基准数据集的聚类稳健性和大量聚类分离来优于现有方法。该软件还提供了一个评估报告,用于验证获得的非单例聚类的质量。

结论

ALFATClust 能够生成具有高聚类内序列相似性和聚类之间大量分离的序列聚类,而无需用户决定精确的相似性截止阈值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6db/8969259/fb777d17b166/12859_2022_4643_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6db/8969259/9f89c1e03ac9/12859_2022_4643_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6db/8969259/50fe3ba2add8/12859_2022_4643_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6db/8969259/d2b9355dc3b5/12859_2022_4643_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6db/8969259/b91b2d245d86/12859_2022_4643_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6db/8969259/3e163bcc147d/12859_2022_4643_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6db/8969259/fb777d17b166/12859_2022_4643_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6db/8969259/9f89c1e03ac9/12859_2022_4643_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6db/8969259/50fe3ba2add8/12859_2022_4643_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6db/8969259/d2b9355dc3b5/12859_2022_4643_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6db/8969259/b91b2d245d86/12859_2022_4643_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6db/8969259/3e163bcc147d/12859_2022_4643_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c6db/8969259/fb777d17b166/12859_2022_4643_Fig6_HTML.jpg

相似文献

1
Clustering biological sequences with dynamic sequence similarity threshold.使用动态序列相似度阈值对生物序列进行聚类。
BMC Bioinformatics. 2022 Mar 30;23(1):108. doi: 10.1186/s12859-022-04643-9.
2
DNACLUST: accurate and efficient clustering of phylogenetic marker genes.DNACLUST:准确高效的系统发育标记基因聚类
BMC Bioinformatics. 2011 Jun 30;12:271. doi: 10.1186/1471-2105-12-271.
3
Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing.基于FORCE -A布局启发式算法的蛋白质序列大规模聚类用于加权聚类编辑。
BMC Bioinformatics. 2007 Oct 17;8:396. doi: 10.1186/1471-2105-8-396.
4
ProClust: improved clustering of protein sequences with an extended graph-based approach.ProClust:基于扩展的图形方法改进蛋白质序列聚类
Bioinformatics. 2002;18 Suppl 2:S182-91. doi: 10.1093/bioinformatics/18.suppl_2.s182.
5
CLUSS: clustering of protein sequences based on a new similarity measure.CLUSS:基于一种新的相似性度量对蛋白质序列进行聚类。
BMC Bioinformatics. 2007 Aug 4;8:286. doi: 10.1186/1471-2105-8-286.
6
Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks.使用从序列相似性得分转换而来的新度量以及神经网络进行的序列比对来对蛋白质序列进行聚类。
BMC Bioinformatics. 2005 Oct 3;6:242. doi: 10.1186/1471-2105-6-242.
7
High-quality sequence clustering guided by network topology and multiple alignment likelihood.网络拓扑和多重比对可能性引导的高质量序列聚类。
Bioinformatics. 2012 Apr 15;28(8):1078-85. doi: 10.1093/bioinformatics/bts098. Epub 2012 Feb 25.
8
Clustering protein sequences--structure prediction by transitive homology.蛋白质序列聚类——通过传递同源性进行结构预测
Bioinformatics. 2001 Oct;17(10):935-41. doi: 10.1093/bioinformatics/17.10.935.
9
MMseqs software suite for fast and deep clustering and searching of large protein sequence sets.MMseqs软件套件,用于对大型蛋白质序列集进行快速且深入的聚类和搜索。
Bioinformatics. 2016 May 1;32(9):1323-30. doi: 10.1093/bioinformatics/btw006. Epub 2016 Jan 6.
10
MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores.MeShClust v3.0:使用均值漂移算法和无比对身份分数对 DNA 序列进行高质量聚类。
BMC Genomics. 2022 Jun 6;23(1):423. doi: 10.1186/s12864-022-08619-0.

引用本文的文献

1
VIProDesign: Viral Protein Panel Design for Highly Variable Viruses to Evaluate Immune Responses and Identify Broadly Neutralizing Antibodies.VIProDesign:用于高度可变病毒的病毒蛋白组设计,以评估免疫反应并鉴定广泛中和抗体。
bioRxiv. 2025 Jun 12:2025.05.21.654924. doi: 10.1101/2025.05.21.654924.
2
The virome of the panglobal, wide host-range plant pathogen : phylogeography and evolutionary insights.全球广泛宿主范围的植物病原体的病毒组:系统地理学与进化见解
Virus Evol. 2025 Apr 1;11(1):veaf020. doi: 10.1093/ve/veaf020. eCollection 2025.
3
GradHC: highly reliable gradual hash-based clustering for DNA storage systems.

本文引用的文献

1
Dashing: fast and accurate genomic distances with HyperLogLog.使用 HyperLogLog 实现快速准确的基因组距离计算。
Genome Biol. 2019 Dec 4;20(1):265. doi: 10.1186/s13059-019-1875-0.
2
CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database.CARD 2020:利用综合抗生素耐药数据库进行抗生素耐药组监测。
Nucleic Acids Res. 2020 Jan 8;48(D1):D517-D525. doi: 10.1093/nar/gkz935.
3
From Louvain to Leiden: guaranteeing well-connected communities.从鲁汶到莱顿:保障互联互通的社区。
GradHC:用于 DNA 存储系统的高可靠基于渐进哈希的聚类。
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae274.
4
Accurately clustering biological sequences in linear time by relatedness sorting.通过相关排序在线性时间内准确地对生物序列进行聚类。
Nat Commun. 2024 Apr 8;15(1):3047. doi: 10.1038/s41467-024-47371-9.
5
AlignScape, displaying sequence similarity using self-organizing maps.AlignScape,使用自组织映射显示序列相似性。
Front Bioinform. 2024 Jan 26;4:1321508. doi: 10.3389/fbinf.2024.1321508. eCollection 2024.
6
Application of Feature Definition and Quantification in Biological Sequence Analysis.特征定义与量化在生物序列分析中的应用。
Curr Genomics. 2023 Oct 27;24(2):64-65. doi: 10.2174/1389202924666230816150732.
Sci Rep. 2019 Mar 26;9(1):5233. doi: 10.1038/s41598-019-41695-z.
4
ARGDIT: a validation and integration toolkit for Antimicrobial Resistance Gene Databases.ARGDIT:抗菌药物耐药基因数据库的验证和集成工具包。
Bioinformatics. 2019 Jul 15;35(14):2466-2474. doi: 10.1093/bioinformatics/bty987.
5
PLSDB: a resource of complete bacterial plasmids.PLSDB:一个完整的细菌质粒资源库。
Nucleic Acids Res. 2019 Jan 8;47(D1):D195-D202. doi: 10.1093/nar/gky1050.
6
De novo clustering of long reads by gene from transcriptomics data.基于转录组学数据的基因从头聚类长读长。
Nucleic Acids Res. 2019 Jan 10;47(1):e2. doi: 10.1093/nar/gky834.
7
Clustering huge protein sequence sets in linear time.线性时间内的大规模蛋白质序列集聚类。
Nat Commun. 2018 Jun 29;9(1):2542. doi: 10.1038/s41467-018-04964-5.
8
MeShClust: an intelligent tool for clustering DNA sequences.MeShClust:一种用于聚类 DNA 序列的智能工具。
Nucleic Acids Res. 2018 Aug 21;46(14):e83. doi: 10.1093/nar/gky315.
9
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.MMseqs2支持进行灵敏的蛋白质序列搜索,以分析海量数据集。
Nat Biotechnol. 2017 Nov;35(11):1026-1028. doi: 10.1038/nbt.3988. Epub 2017 Oct 16.
10
viruSITE-integrated database for viral genomics.用于病毒基因组学的viruSITE整合数据库。
Database (Oxford). 2016 Dec 26;2016. doi: 10.1093/database/baw162. Print 2016.