• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

蛋白质序列的密度峰值聚类与 Pfam 家族相关,与手动家族注释相比,揭示了明显的相似性和有趣的差异。

Density Peak clustering of protein sequences associated to a Pfam clan reveals clear similarities and interesting differences with respect to manual family annotation.

机构信息

SISSA, 34136, Trieste, Italy.

Centre for Evolution and Cancer, The Institute of Cancer Research, London, SM2 5NG, UK.

出版信息

BMC Bioinformatics. 2021 Mar 12;22(1):121. doi: 10.1186/s12859-021-04013-x.

DOI:10.1186/s12859-021-04013-x
PMID:33711918
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7955657/
Abstract

BACKGROUND

The identification of protein families is of outstanding practical importance for in silico protein annotation and is at the basis of several bioinformatic resources. Pfam is possibly the most well known protein family database, built in many years of work by domain experts with extensive use of manual curation. This approach is generally very accurate, but it is quite time consuming and it may suffer from a bias generated from the hand-curation itself, which is often guided by the available experimental evidence.

RESULTS

We introduce a procedure that aims to identify automatically putative protein families. The procedure is based on Density Peak Clustering and uses as input only local pairwise alignments between protein sequences. In the experiment we present here, we ran the algorithm on about 4000 full-length proteins with at least one domain classified by Pfam as belonging to the Pseudouridine synthase and Archaeosine transglycosylase (PUA) clan. We obtained 71 automatically-generated sequence clusters with at least 100 members. While our clusters were largely consistent with the Pfam classification, showing good overlap with either single or multi-domain Pfam family architectures, we also observed some inconsistencies. The latter were inspected using structural and sequence based evidence, which suggested that the automatic classification captured evolutionary signals reflecting non-trivial features of protein family architectures. Based on this analysis we identified a putative novel pre-PUA domain as well as alternative boundaries for a few PUA or PUA-associated families. As a first indication that our approach was unlikely to be clan-specific, we performed the same analysis on the P53 clan, obtaining comparable results.

CONCLUSIONS

The clustering procedure described in this work takes advantage of the information contained in a large set of pairwise alignments and successfully identifies a set of putative families and family architectures in an unsupervised manner. Comparison with the Pfam classification highlights significant overlap and points to interesting differences, suggesting that our new algorithm could have potential in applications related to automatic protein classification. Testing this hypothesis, however, will require further experiments on large and diverse sequence datasets.

摘要

背景

鉴定蛋白质家族对于计算机蛋白质注释具有突出的实际重要性,并且是几个生物信息资源的基础。Pfam 可能是最著名的蛋白质家族数据库,由领域专家多年的工作构建而成,广泛使用手动注释。这种方法通常非常准确,但非常耗时,并且可能会受到手动注释本身产生的偏差的影响,这种偏差通常是由可用的实验证据指导的。

结果

我们引入了一种旨在自动识别假定蛋白质家族的程序。该程序基于密度峰聚类,仅使用蛋白质序列之间的局部两两比对作为输入。在我们这里呈现的实验中,我们在大约 4000 个全长蛋白质上运行了该算法,这些蛋白质至少有一个被 Pfam 归类为属于假尿嘧啶合酶和考古核苷转移酶(PUA)族的结构域。我们得到了 71 个自动生成的序列簇,每个簇至少有 100 个成员。虽然我们的簇与 Pfam 分类基本一致,与单域或多域 Pfam 家族结构具有良好的重叠,但我们也观察到一些不一致。后者使用结构和序列证据进行了检查,这些证据表明自动分类捕获了反映蛋白质家族结构非平凡特征的进化信号。基于此分析,我们鉴定了一个假定的新的预 PUA 结构域以及几个 PUA 或 PUA 相关家族的替代边界。作为我们的方法不太可能是特定于家族的第一个迹象,我们在 P53 家族上执行了相同的分析,得到了可比的结果。

结论

本文描述的聚类程序利用了大量两两比对中包含的信息,以无监督的方式成功地识别了一组假定的家族和家族结构。与 Pfam 分类的比较突出了显著的重叠,并指出了有趣的差异,表明我们的新算法在与自动蛋白质分类相关的应用中可能具有潜力。然而,要验证这一假设,需要在大型和多样化的序列数据集上进行进一步的实验。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/90e142d04075/12859_2021_4013_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/ae5013084e12/12859_2021_4013_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/102b8c2b1001/12859_2021_4013_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/3a4599bfdf17/12859_2021_4013_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/2ed080736edd/12859_2021_4013_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/5f1c3b39667a/12859_2021_4013_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/c4281ba3d6db/12859_2021_4013_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/a9add930a710/12859_2021_4013_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/f171161f28c7/12859_2021_4013_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/d642ac576811/12859_2021_4013_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/9f56c0968996/12859_2021_4013_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/f382588e71a6/12859_2021_4013_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/90e142d04075/12859_2021_4013_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/ae5013084e12/12859_2021_4013_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/102b8c2b1001/12859_2021_4013_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/3a4599bfdf17/12859_2021_4013_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/2ed080736edd/12859_2021_4013_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/5f1c3b39667a/12859_2021_4013_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/c4281ba3d6db/12859_2021_4013_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/a9add930a710/12859_2021_4013_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/f171161f28c7/12859_2021_4013_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/d642ac576811/12859_2021_4013_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/9f56c0968996/12859_2021_4013_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/f382588e71a6/12859_2021_4013_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d35/7955657/90e142d04075/12859_2021_4013_Fig12_HTML.jpg

相似文献

1
Density Peak clustering of protein sequences associated to a Pfam clan reveals clear similarities and interesting differences with respect to manual family annotation.蛋白质序列的密度峰值聚类与 Pfam 家族相关,与手动家族注释相比,揭示了明显的相似性和有趣的差异。
BMC Bioinformatics. 2021 Mar 12;22(1):121. doi: 10.1186/s12859-021-04013-x.
2
DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets.DPCfam:通过对大型序列数据集的密度峰聚类进行无监督的蛋白质家族分类。
PLoS Comput Biol. 2022 Oct 19;18(10):e1010610. doi: 10.1371/journal.pcbi.1010610. eCollection 2022 Oct.
3
Pfam: a comprehensive database of protein domain families based on seed alignments.Pfam:一个基于种子比对的蛋白质结构域家族综合数据库。
Proteins. 1997 Jul;28(3):405-20. doi: 10.1002/(sici)1097-0134(199707)28:3<405::aid-prot10>3.0.co;2-l.
4
SUPFAM--a database of potential protein superfamily relationships derived by comparing sequence-based and structure-based families: implications for structural genomics and function annotation in genomes.SUPFAM——一个通过比较基于序列和基于结构的家族而得出的潜在蛋白质超家族关系数据库:对结构基因组学和基因组功能注释的意义。
Nucleic Acids Res. 2002 Jan 1;30(1):289-93. doi: 10.1093/nar/30.1.289.
5
Exhaustive enumeration of protein domain families.蛋白质结构域家族的详尽枚举。
J Mol Biol. 2003 May 2;328(3):749-67. doi: 10.1016/s0022-2836(03)00269-9.
6
Clustering the annotation space of proteins.对蛋白质的注释空间进行聚类。
BMC Bioinformatics. 2005 Feb 9;6:24. doi: 10.1186/1471-2105-6-24.
7
EVEREST: automatic identification and classification of protein domains in all protein sequences.EVEREST:对所有蛋白质序列中的蛋白质结构域进行自动识别和分类。
BMC Bioinformatics. 2006 Jun 2;7:277. doi: 10.1186/1471-2105-7-277.
8
The Pfam protein families database: towards a more sustainable future.Pfam蛋白质家族数据库:迈向更可持续的未来。
Nucleic Acids Res. 2016 Jan 4;44(D1):D279-85. doi: 10.1093/nar/gkv1344. Epub 2015 Dec 15.
9
Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB.将蛋白质序列分配到现有的域和家族分类系统:Pfam 和 PDB。
Bioinformatics. 2012 Nov 1;28(21):2763-72. doi: 10.1093/bioinformatics/bts533. Epub 2012 Aug 31.
10
Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing.基于 minhashing 的未注释蛋白质保守区域大数据集的无比对聚类。
BMC Bioinformatics. 2018 Mar 5;19(1):83. doi: 10.1186/s12859-018-2080-y.

引用本文的文献

1
Protein family annotation for the Unified Human Gastrointestinal Proteome by DPCfam clustering.通过 DPCfam 聚类对统一的人类胃肠道蛋白质组进行蛋白质家族注释。
Sci Data. 2024 Jun 1;11(1):568. doi: 10.1038/s41597-024-03131-4.
2
DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets.DPCfam:通过对大型序列数据集的密度峰聚类进行无监督的蛋白质家族分类。
PLoS Comput Biol. 2022 Oct 19;18(10):e1010610. doi: 10.1371/journal.pcbi.1010610. eCollection 2022 Oct.

本文引用的文献

1
CDD/SPARCLE: the conserved domain database in 2020.CDD/SPARCLE:2020 年的保守结构域数据库。
Nucleic Acids Res. 2020 Jan 8;48(D1):D265-D268. doi: 10.1093/nar/gkz991.
2
Unified rational protein engineering with sequence-based deep representation learning.基于序列的深度学习表示的统一理性蛋白质工程。
Nat Methods. 2019 Dec;16(12):1315-1322. doi: 10.1038/s41592-019-0598-1. Epub 2019 Oct 21.
3
InterPro in 2019: improving coverage, classification and access to protein sequence annotations.InterPro 在 2019 年:提高蛋白质序列注释的覆盖范围、分类和访问。
Nucleic Acids Res. 2019 Jan 8;47(D1):D351-D360. doi: 10.1093/nar/gky1100.
4
The Pfam protein families database in 2019.2019 年 Pfam 蛋白质家族数据库。
Nucleic Acids Res. 2019 Jan 8;47(D1):D427-D432. doi: 10.1093/nar/gky995.
5
Clustering huge protein sequence sets in linear time.线性时间内的大规模蛋白质序列集聚类。
Nat Commun. 2018 Jun 29;9(1):2542. doi: 10.1038/s41467-018-04964-5.
6
A Completely Reimplemented MPI Bioinformatics Toolkit with a New HHpred Server at its Core.一个完全重新实现的 MPI 生物信息学工具包,其核心是一个新的 HHpred 服务器。
J Mol Biol. 2018 Jul 20;430(15):2237-2243. doi: 10.1016/j.jmb.2017.12.007. Epub 2017 Dec 16.
7
Gene3D: Extensive prediction of globular domains in proteins.基因3D:蛋白质中球状结构域的广泛预测。
Nucleic Acids Res. 2018 Jan 4;46(D1):D1282. doi: 10.1093/nar/gkx1187.
8
20 years of the SMART protein domain annotation resource.SMART 蛋白质结构域注释资源 20 年。
Nucleic Acids Res. 2018 Jan 4;46(D1):D493-D496. doi: 10.1093/nar/gkx922.
9
Manual classification strategies in the ECOD database.ECOD数据库中的手动分类策略。
Proteins. 2015 Jul;83(7):1238-51. doi: 10.1002/prot.24818. Epub 2015 May 8.
10
Machine learning. Clustering by fast search and find of density peaks.机器学习。基于密度峰值的快速搜索和发现的聚类。
Science. 2014 Jun 27;344(6191):1492-6. doi: 10.1126/science.1242072.