• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于 minhashing 的未注释蛋白质保守区域大数据集的无比对聚类。

Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing.

机构信息

School of EECS, Washington State University, 355 NE Spokane St, Pullman, 99164, USA.

Paul G. Allen School for Global Animal Health, Washington State University, Pullman, 99164, USA.

出版信息

BMC Bioinformatics. 2018 Mar 5;19(1):83. doi: 10.1186/s12859-018-2080-y.

DOI:10.1186/s12859-018-2080-y
PMID:29506470
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5838936/
Abstract

BACKGROUND

Clustering of protein sequences is of key importance in predicting the structure and function of newly sequenced proteins and is also of use for their annotation. With the advent of multiple high-throughput sequencing technologies, new protein sequences are becoming available at an extraordinary rate. The rapid growth rate has impeded deployment of existing protein clustering/annotation tools which depend largely on pairwise sequence alignment.

RESULTS

In this paper, we propose an alignment-free clustering approach, coreClust, for annotating protein sequences using detected conserved regions. The proposed algorithm uses Min-Wise Independent Hashing for identifying similar conserved regions. Min-Wise Independent Hashing works by generating a (w,c)-sketch for each document and comparing these sketches. Our algorithm fits well within the MapReduce framework, permitting scalability. We show that coreClust generates results comparable to existing known methods. In particular, we show that the clusters generated by our algorithm capture the subfamilies of the Pfam domain families for which the sequences in a cluster have a similar domain architecture. We show that for a data set of 90,000 sequences (about 250,000 domain regions), the clusters generated by our algorithm give a 75% average weighted F1 score, our accuracy metric, when compared to the clusters generated by a semi-exhaustive pairwise alignment algorithm.

CONCLUSIONS

The new clustering algorithm can be used to generate meaningful clusters of conserved regions. It is a scalable method that when paired with our prior work, NADDA for detecting conserved regions, provides a complete end-to-end pipeline for annotating protein sequences.

摘要

背景

蛋白质序列聚类对于预测新测序蛋白质的结构和功能至关重要,对其注释也很有用。随着多种高通量测序技术的出现,新的蛋白质序列正以前所未有的速度出现。快速的增长率阻碍了现有蛋白质聚类/注释工具的部署,这些工具在很大程度上依赖于两两序列比对。

结果

在本文中,我们提出了一种基于无比对聚类方法 coreClust,用于使用检测到的保守区域注释蛋白质序列。所提出的算法使用 Min-Wise Independent Hashing 来识别相似的保守区域。Min-Wise Independent Hashing 通过为每个文档生成一个(w,c)-sketch 并比较这些 sketch 来工作。我们的算法非常适合 MapReduce 框架,具有可扩展性。我们表明 coreClust 生成的结果可与现有已知方法相媲美。特别是,我们表明该算法生成的簇捕获了 Pfam 结构域家族的亚家族,其中簇中的序列具有相似的结构域架构。我们表明,对于 90000 个序列(约 250000 个结构域区域)的数据集,与基于半穷举两两比对算法生成的簇相比,我们算法生成的簇的平均加权 F1 分数为 75%,这是我们的准确性度量。

结论

新的聚类算法可用于生成有意义的保守区域簇。它是一种可扩展的方法,与我们之前的 NADDA 检测保守区域的工作相结合,为注释蛋白质序列提供了一个完整的端到端管道。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e441/5838936/078a6974544b/12859_2018_2080_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e441/5838936/103e59b942ba/12859_2018_2080_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e441/5838936/0e2680eda240/12859_2018_2080_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e441/5838936/a5374f3b8be6/12859_2018_2080_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e441/5838936/b8fb4e59e308/12859_2018_2080_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e441/5838936/a91ef0ace211/12859_2018_2080_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e441/5838936/c0aa633a8085/12859_2018_2080_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e441/5838936/c91af8b377b5/12859_2018_2080_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e441/5838936/078a6974544b/12859_2018_2080_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e441/5838936/103e59b942ba/12859_2018_2080_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e441/5838936/0e2680eda240/12859_2018_2080_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e441/5838936/a5374f3b8be6/12859_2018_2080_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e441/5838936/b8fb4e59e308/12859_2018_2080_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e441/5838936/a91ef0ace211/12859_2018_2080_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e441/5838936/c0aa633a8085/12859_2018_2080_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e441/5838936/c91af8b377b5/12859_2018_2080_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e441/5838936/078a6974544b/12859_2018_2080_Fig8_HTML.jpg

相似文献

1
Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing.基于 minhashing 的未注释蛋白质保守区域大数据集的无比对聚类。
BMC Bioinformatics. 2018 Mar 5;19(1):83. doi: 10.1186/s12859-018-2080-y.
2
A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions.一种用于从头检测蛋白质保守区域的快速无比对方法。
PLoS One. 2016 Aug 23;11(8):e0161338. doi: 10.1371/journal.pone.0161338. eCollection 2016.
3
CLUSS: clustering of protein sequences based on a new similarity measure.CLUSS:基于一种新的相似性度量对蛋白质序列进行聚类。
BMC Bioinformatics. 2007 Aug 4;8:286. doi: 10.1186/1471-2105-8-286.
4
A hybrid clustering approach to recognition of protein families in 114 microbial genomes.一种用于识别114个微生物基因组中蛋白质家族的混合聚类方法。
BMC Bioinformatics. 2004 Apr 29;5:45. doi: 10.1186/1471-2105-5-45.
5
OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy.OXBench:一种用于评估蛋白质多序列比对准确性的基准。
BMC Bioinformatics. 2003 Oct 10;4:47. doi: 10.1186/1471-2105-4-47.
6
Density Peak clustering of protein sequences associated to a Pfam clan reveals clear similarities and interesting differences with respect to manual family annotation.蛋白质序列的密度峰值聚类与 Pfam 家族相关,与手动家族注释相比,揭示了明显的相似性和有趣的差异。
BMC Bioinformatics. 2021 Mar 12;22(1):121. doi: 10.1186/s12859-021-04013-x.
7
Improved K-means clustering algorithm for exploring local protein sequence motifs representing common structural property.用于探索代表共同结构特性的局部蛋白质序列基序的改进K均值聚类算法。
IEEE Trans Nanobioscience. 2005 Sep;4(3):255-65. doi: 10.1109/tnb.2005.853667.
8
CLAGen: a tool for clustering and annotating gene sequences using a suffix tree algorithm.CLAGen:一种使用后缀树算法对基因序列进行聚类和注释的工具。
Biosystems. 2006 Jun;84(3):175-82. doi: 10.1016/j.biosystems.2005.11.001. Epub 2005 Dec 27.
9
On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。
Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.
10
Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks.使用从序列相似性得分转换而来的新度量以及神经网络进行的序列比对来对蛋白质序列进行聚类。
BMC Bioinformatics. 2005 Oct 3;6:242. doi: 10.1186/1471-2105-6-242.

引用本文的文献

1
A Hashing-Based Framework for Enhancing Cluster Delineation of High-Dimensional Single-Cell Profiles.一种基于哈希的框架,用于增强高维单细胞图谱的聚类划分
Phenomics. 2022 May 19;2(5):323-335. doi: 10.1007/s43657-022-00056-z. eCollection 2022 Oct.
2
PASS: Protein Annotation Surveillance Site for Protein Annotation Using Homologous Clusters, NLP, and Sequence Similarity Networks.PASS:使用同源簇、自然语言处理和序列相似性网络进行蛋白质注释的蛋白质注释监测站点。
Front Bioinform. 2021 Sep 29;1:749008. doi: 10.3389/fbinf.2021.749008. eCollection 2021.

本文引用的文献

1
UniProt: the universal protein knowledgebase.通用蛋白质知识库:UniProt
Nucleic Acids Res. 2017 Jan 4;45(D1):D158-D169. doi: 10.1093/nar/gkw1099. Epub 2016 Nov 29.
2
A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions.一种用于从头检测蛋白质保守区域的快速无比对方法。
PLoS One. 2016 Aug 23;11(8):e0161338. doi: 10.1371/journal.pone.0161338. eCollection 2016.
3
Comparative genomics reveals multiple pathways to mutualism for tick-borne pathogens.比较基因组学揭示了蜱传病原体互利共生的多种途径。
BMC Genomics. 2016 Jul 2;17:481. doi: 10.1186/s12864-016-2744-9.
4
The Pfam protein families database: towards a more sustainable future.Pfam蛋白质家族数据库:迈向更可持续的未来。
Nucleic Acids Res. 2016 Jan 4;44(D1):D279-85. doi: 10.1093/nar/gkv1344. Epub 2015 Dec 15.
5
SMART: recent updates, new developments and status in 2015.SMART:2015年的近期更新、新进展及现状
Nucleic Acids Res. 2015 Jan;43(Database issue):D257-60. doi: 10.1093/nar/gku949. Epub 2014 Oct 9.
6
CD-HIT: accelerated for clustering the next-generation sequencing data.CD-HIT:用于加速下一代测序数据聚类的工具。
Bioinformatics. 2012 Dec 1;28(23):3150-2. doi: 10.1093/bioinformatics/bts565. Epub 2012 Oct 11.
7
Alignment-free sequence comparison (I): statistics and power.无比对序列比较(I):统计学与效能
J Comput Biol. 2009 Dec;16(12):1615-34. doi: 10.1089/cmb.2009.0198.
8
Rickettsia phylogenomics: unwinding the intricacies of obligate intracellular life.立克次氏体系统发育基因组学:揭开专性细胞内寄生生活的复杂性
PLoS One. 2008;3(4):e2018. doi: 10.1371/journal.pone.0002018. Epub 2008 Apr 16.
9
EVEREST: automatic identification and classification of protein domains in all protein sequences.EVEREST:对所有蛋白质序列中的蛋白质结构域进行自动识别和分类。
BMC Bioinformatics. 2006 Jun 2;7:277. doi: 10.1186/1471-2105-7-277.
10
Exhaustive enumeration of protein domain families.蛋白质结构域家族的详尽枚举。
J Mol Biol. 2003 May 2;328(3):749-67. doi: 10.1016/s0022-2836(03)00269-9.