• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

生物信息学中的序列聚类:一项实证研究。

Sequence clustering in bioinformatics: an empirical study.

作者信息

Zou Quan, Lin Gang, Jiang Xingpeng, Liu Xiangrong, Zeng Xiangxiang

机构信息

Tianjin University.

University of Electronic Science and Technology of China.

出版信息

Brief Bioinform. 2020 Jan 17;21(1):1-10. doi: 10.1093/bib/bby090.

DOI:10.1093/bib/bby090
PMID:30239587
Abstract

Sequence clustering is a basic bioinformatics task that is attracting renewed attention with the development of metagenomics and microbiomics. The latest sequencing techniques have decreased costs and as a result, massive amounts of DNA/RNA sequences are being produced. The challenge is to cluster the sequence data using stable, quick and accurate methods. For microbiome sequencing data, 16S ribosomal RNA operational taxonomic units are typically used. However, there is often a gap between algorithm developers and bioinformatics users. Different software tools can produce diverse results and users can find them difficult to analyze. Understanding the different clustering mechanisms is crucial to understanding the results that they produce. In this review, we selected several popular clustering tools, briefly explained the key computing principles, analyzed their characters and compared them using two independent benchmark datasets. Our aim is to assist bioinformatics users in employing suitable clustering tools effectively to analyze big sequencing data. Related data, codes and software tools were accessible at the link http://lab.malab.cn/∼lg/clustering/.

摘要

序列聚类是一项基本的生物信息学任务,随着宏基因组学和微生物组学的发展,它正重新受到关注。最新的测序技术降低了成本,因此正在产生大量的DNA/RNA序列。挑战在于使用稳定、快速且准确的方法对序列数据进行聚类。对于微生物组测序数据,通常使用16S核糖体RNA操作分类单元。然而,算法开发者和生物信息学用户之间往往存在差距。不同的软件工具可能会产生不同的结果,用户可能会发现难以对其进行分析。理解不同的聚类机制对于理解它们所产生的结果至关重要。在本综述中,我们选择了几种流行的聚类工具,简要解释了关键计算原理,分析了它们的特点,并使用两个独立的基准数据集对它们进行了比较。我们的目的是帮助生物信息学用户有效地使用合适的聚类工具来分析大型测序数据。相关数据、代码和软件工具可通过链接http://lab.malab.cn/∼lg/clustering/获取。

相似文献

1
Sequence clustering in bioinformatics: an empirical study.生物信息学中的序列聚类:一项实证研究。
Brief Bioinform. 2020 Jan 17;21(1):1-10. doi: 10.1093/bib/bby090.
2
Comparison of Methods for Picking the Operational Taxonomic Units From Amplicon Sequences.从扩增子序列中挑选操作分类单元的方法比较
Front Microbiol. 2021 Mar 24;12:644012. doi: 10.3389/fmicb.2021.644012. eCollection 2021.
3
OptiFit: an Improved Method for Fitting Amplicon Sequences to Existing OTUs.OptiFit:一种改进的扩增子序列与现有 OTU 拟合方法。
mSphere. 2022 Feb 23;7(1):e0091621. doi: 10.1128/msphere.00916-21. Epub 2022 Feb 2.
4
CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment.CLUSTOM-CLOUD:用于在云环境中对16S rRNA序列数据进行聚类的基于内存数据网格的软件。
PLoS One. 2016 Mar 8;11(3):e0151064. doi: 10.1371/journal.pone.0151064. eCollection 2016.
5
Open-Source Sequence Clustering Methods Improve the State Of the Art.开源序列聚类方法提升了现有技术水平。
mSystems. 2016 Feb 9;1(1). doi: 10.1128/mSystems.00003-15. eCollection 2016 Jan-Feb.
6
bioOTU: An Improved Method for Simultaneous Taxonomic Assignments and Operational Taxonomic Units Clustering of 16s rRNA Gene Sequences.生物OTU:一种用于16S rRNA基因序列分类分配和操作分类单元聚类的改进方法。
J Comput Biol. 2016 Apr;23(4):229-38. doi: 10.1089/cmb.2015.0214. Epub 2016 Mar 7.
7
DBH: A de Bruijn graph-based heuristic method for clustering large-scale 16S rRNA sequences into OTUs.DBH:一种基于德布鲁因图的启发式方法,用于将大规模16S rRNA序列聚类为操作分类单元。
J Theor Biol. 2017 Jul 21;425:80-87. doi: 10.1016/j.jtbi.2017.04.019. Epub 2017 Apr 26.
8
DACE: a scalable DP-means algorithm for clustering extremely large sequence data.DACE:一种用于对超大型序列数据进行聚类的可扩展DP均值算法。
Bioinformatics. 2017 Mar 15;33(6):834-842. doi: 10.1093/bioinformatics/btw722.
9
MtHc: a motif-based hierarchical method for clustering massive 16S rRNA sequences into OTUs.MtHc:一种基于基序的层次化方法,用于将大量16S rRNA序列聚类为操作分类单元。
Mol Biosyst. 2015 Jul;11(7):1907-13. doi: 10.1039/c5mb00089k.
10
DNACLUST: accurate and efficient clustering of phylogenetic marker genes.DNACLUST:准确高效的系统发育标记基因聚类
BMC Bioinformatics. 2011 Jun 30;12:271. doi: 10.1186/1471-2105-12-271.

引用本文的文献

1
varVAMP: degenerate primer design for tiled full genome sequencing and qPCR.可变VAMP:用于平铺式全基因组测序和定量PCR的简并引物设计。
Nat Commun. 2025 May 31;16(1):5067. doi: 10.1038/s41467-025-60175-9.
2
TFProtBert: Detection of Transcription Factors Binding to Methylated DNA Using ProtBert Latent Space Representation.TFProtBert:利用ProtBert潜在空间表示法检测与甲基化DNA结合的转录因子
Int J Mol Sci. 2025 Apr 29;26(9):4234. doi: 10.3390/ijms26094234.
3
T4Seeker: a hybrid model for type IV secretion effectors identification.
T4Seeker:一种用于 IV 型分泌效应器识别的混合模型。
BMC Biol. 2024 Nov 14;22(1):259. doi: 10.1186/s12915-024-02064-z.
4
isolateR: an R package for generating microbial libraries from Sanger sequencing data.isolateR:一个用于从桑格测序数据生成微生物文库的R包。
Bioinformatics. 2024 Jul 1;40(7). doi: 10.1093/bioinformatics/btae448.
5
Analysis of Emerging Variants of Turkey Reovirus using Machine Learning.基于机器学习的火鸡呼肠孤病毒新型变异株分析
Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae224.
6
Accurately clustering biological sequences in linear time by relatedness sorting.通过相关排序在线性时间内准确地对生物序列进行聚类。
Nat Commun. 2024 Apr 8;15(1):3047. doi: 10.1038/s41467-024-47371-9.
7
TMSC-m7G: A transformer architecture based on multi-sense-scaled embedding features and convolutional neural network to identify RNA N7-methylguanosine sites.TMSC-m7G:一种基于多感官尺度嵌入特征和卷积神经网络的变压器架构,用于识别RNA N7-甲基鸟苷位点。
Comput Struct Biotechnol J. 2023 Dec 1;23:129-139. doi: 10.1016/j.csbj.2023.11.052. eCollection 2024 Dec.
8
WFA-GPU: gap-affine pairwise read-alignment using GPUs.WFA-GPU:基于 GPU 的缺口仿射两两序列比对
Bioinformatics. 2023 Dec 1;39(12). doi: 10.1093/bioinformatics/btad701.
9
PromGER: Promoter Prediction Based on Graph Embedding and Ensemble Learning for Eukaryotic Sequence.基于图嵌入和集成学习的真核序列启动子预测
Genes (Basel). 2023 Jul 13;14(7):1441. doi: 10.3390/genes14071441.
10
Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies.蛋白质序列嵌入空间的树状图可视化可提高不同蛋白质超家族功能聚类的效果。
Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac619.