• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

提高全局蛋白质同源物检测的功能识别能力。

Improved global protein homolog detection with major gains in function identification.

机构信息

Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011.

Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA 50011.

出版信息

Proc Natl Acad Sci U S A. 2023 Feb 28;120(9):e2211823120. doi: 10.1073/pnas.2211823120. Epub 2023 Feb 24.

DOI:10.1073/pnas.2211823120
PMID:36827259
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9992864/
Abstract

There are several hundred million protein sequences, but the relationships among them are not fully available from existing homolog detection methods. There is an essential need for an improved method to push homolog detection to lower levels of sequence identity. The method used here relies on a language model to represent proteins numerically in a matrix (an embedding) and uses discrete cosine transforms to compress the data to extract the most essential part, significantly reducing the data size. This PRotein Ortholog Search Tool (PROST) is significantly faster with linear runtimes, and most importantly, computes the distances between pairs of protein sequences to yield homologs at significantly lower levels of sequence identity than previously. The extent of allosteric effects in proteins points out the importance of global aspects of structure and sequence. PROST excels at global homology detection but not at detecting local homologs. Results are validated by strong similarities between the corresponding pairs of structures. The number of remote homologs detected increased significantly and pushes the effective sequence matches more deeply into the twilight zone. Human protein sequences presently having no assigned function now find significant numbers of putative homologs for 93% of cases and structurally verified assigned functions for 76.4% of these cases. The data compression enables massive searches for homologs with short search times while yielding significant gains in the numbers of remote homologs detected. The method is sufficiently efficient to permit whole-genome/proteome comparisons. The PROST web server is accessible at https://mesihk.github.io/prost.

摘要

有数亿种蛋白质序列,但现有的同源检测方法无法完全揭示它们之间的关系。因此,我们迫切需要一种改进的方法,将同源检测推进到更低的序列同一性水平。这里使用的方法依赖于语言模型将蛋白质数值表示为矩阵(嵌入),并使用离散余弦变换来压缩数据以提取最基本的部分,从而大大减少数据量。这种 PRotein Ortholog Search Tool (PROST) 不仅运行时间呈线性,而且最重要的是,它计算了蛋白质序列对之间的距离,从而以比以前更低的序列同一性水平产生了同源物。蛋白质的变构效应程度指出了结构和序列全局方面的重要性。PROST 在全局同源检测方面表现出色,但在检测局部同源物方面却不尽如人意。结果通过对应结构之间的强相似性得到验证。检测到的远程同源物数量显著增加,并将有效序列匹配推向更深的黄昏区域。目前没有指定功能的人类蛋白质序列现在在 93%的情况下发现了大量假定的同源物,在这些情况下的 76.4%具有结构上验证的指定功能。数据压缩使得可以在短时间内进行大规模的同源搜索,同时大大增加了检测到的远程同源物数量。该方法效率足够高,允许进行全基因组/蛋白质组比较。PROST 网络服务器可在 https://mesihk.github.io/prost 上访问。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79d2/9992864/9593d24183e1/pnas.2211823120fig04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79d2/9992864/fd9c9e6d4c6f/pnas.2211823120fig01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79d2/9992864/b45fa72037fc/pnas.2211823120fig02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79d2/9992864/530c1461a4e7/pnas.2211823120fig03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79d2/9992864/9593d24183e1/pnas.2211823120fig04.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79d2/9992864/fd9c9e6d4c6f/pnas.2211823120fig01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79d2/9992864/b45fa72037fc/pnas.2211823120fig02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79d2/9992864/530c1461a4e7/pnas.2211823120fig03.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79d2/9992864/9593d24183e1/pnas.2211823120fig04.jpg

相似文献

1
Improved global protein homolog detection with major gains in function identification.提高全局蛋白质同源物检测的功能识别能力。
Proc Natl Acad Sci U S A. 2023 Feb 28;120(9):e2211823120. doi: 10.1073/pnas.2211823120. Epub 2023 Feb 24.
2
Using homology relations within a database markedly boosts protein sequence similarity search.利用数据库中的同源关系显著提高了蛋白质序列相似性搜索的效率。
Proc Natl Acad Sci U S A. 2015 Jun 2;112(22):7003-8. doi: 10.1073/pnas.1424324112. Epub 2015 May 18.
3
Efficient recognition of protein fold at low sequence identity by conservative application of Psi-BLAST: validation.通过保守应用Psi-BLAST在低序列同一性下高效识别蛋白质折叠:验证
J Mol Recognit. 2005 Mar-Apr;18(2):139-49. doi: 10.1002/jmr.721.
4
Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases.用于蛋白质同源物的迭代序列/二级结构搜索:与氨基酸序列比对的比较及在基因组数据库中折叠识别的应用
Bioinformatics. 2000 Nov;16(11):988-1002. doi: 10.1093/bioinformatics/16.11.988.
5
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
6
Application of nonnegative matrix factorization to improve profile-profile alignment features for fold recognition and remote homolog detection.非负矩阵分解在改善用于折叠识别和远程同源物检测的轮廓-轮廓比对特征方面的应用。
BMC Bioinformatics. 2008 Jul 1;9:298. doi: 10.1186/1471-2105-9-298.
7
Protein domain embeddings for fast and accurate similarity search.蛋白质结构域嵌入用于快速准确的相似性搜索。
Genome Res. 2024 Oct 11;34(9):1434-1444. doi: 10.1101/gr.279127.124.
8
Towards alignment independent quantitative assessment of homology detection.朝向同源检测的对齐独立定量评估。
PLoS One. 2006 Dec 27;1(1):e113. doi: 10.1371/journal.pone.0000113.
9
Ultra-fast global homology detection with Discrete Cosine Transform and Dynamic Time Warping.基于离散余弦变换和动态时间规整的超快速全局同源检测。
Bioinformatics. 2018 Sep 15;34(18):3118-3125. doi: 10.1093/bioinformatics/bty309.
10
New amino acid substitution matrix brings sequence alignments into agreement with structure matches.新的氨基酸替代矩阵使序列比对与结构匹配一致。
Proteins. 2021 Jun;89(6):671-682. doi: 10.1002/prot.26050. Epub 2021 Feb 2.

引用本文的文献

1
Drug resistance and tumor heterogeneity: cells and ensembles.耐药性与肿瘤异质性:细胞与细胞群体
Biophys Rev. 2025 May 31;17(3):759-779. doi: 10.1007/s12551-025-01320-y. eCollection 2025 Jun.
2
Medium-sized protein language models perform well at transfer learning on realistic datasets.中等规模的蛋白质语言模型在真实数据集上的迁移学习中表现良好。
Sci Rep. 2025 Jul 1;15(1):21400. doi: 10.1038/s41598-025-05674-x.
3
Case Studies of Orphan Domain Reclassification in ECOD by Expert Curation.通过专家管理对ECOD中孤儿结构域重新分类的案例研究。

本文引用的文献

1
OCA-T1 and OCA-T2 are coactivators of POU2F3 in the tuft cell lineage.OCA-T1 和 OCA-T2 是 tuft 细胞谱系中 POU2F3 的共激活因子。
Nature. 2022 Jul;607(7917):169-175. doi: 10.1038/s41586-022-04842-7. Epub 2022 May 16.
2
SCOPe: improvements to the structural classification of proteins - extended database to facilitate variant interpretation and machine learning.SCOPe:蛋白质结构分类的改进——扩展数据库以促进变体解释和机器学习。
Nucleic Acids Res. 2022 Jan 7;50(D1):D553-D559. doi: 10.1093/nar/gkab1054.
3
Accurate prediction of protein structures and interactions using a three-track neural network.
Proteins. 2025 May 26. doi: 10.1002/prot.26840.
4
Identification and catalog of viral transcriptional regulators in human diseases.人类疾病中病毒转录调节因子的鉴定与编目。
iScience. 2025 Feb 21;28(3):112081. doi: 10.1016/j.isci.2025.112081. eCollection 2025 Mar 21.
5
A fast approach for structural and evolutionary analysis based on energetic profile protein comparison.一种基于能量分布蛋白质比较的结构与进化分析快速方法。
Nat Commun. 2025 Mar 6;16(1):2231. doi: 10.1038/s41467-025-57374-9.
6
The neuropeptidomes of the sea cucumbers Stichopus cf. horrens and Holothuria scabra.糙刺参(Stichopus cf. horrens)和糙海参(Holothuria scabra)的神经肽组
Sci Rep. 2025 Feb 27;15(1):7032. doi: 10.1038/s41598-025-85696-7.
7
Major advances in protein function assignment by remote homolog detection with protein language models - A review.利用蛋白质语言模型通过远程同源性检测进行蛋白质功能分配的重大进展——综述
Curr Opin Struct Biol. 2025 Feb;90:102984. doi: 10.1016/j.sbi.2025.102984. Epub 2025 Jan 27.
8
A Novel Membrane-Associated Protein Aids Bacterial Colonization of Maize.一种新型膜相关蛋白助力细菌在玉米上定殖。
ACS Synth Biol. 2025 Jan 17;14(1):206-215. doi: 10.1021/acssynbio.4c00489. Epub 2024 Dec 21.
9
Scaling down for efficiency: Medium-sized protein language models perform well at transfer learning on realistic datasets.为提高效率而缩小规模:中型蛋白质语言模型在真实数据集的迁移学习中表现良好。
bioRxiv. 2025 Jan 28:2024.11.22.624936. doi: 10.1101/2024.11.22.624936.
10
Improving the Annotations of JCVI-Syn3a Proteins.改进 JCVI-Syn3a 蛋白质的注释。
Methods Mol Biol. 2025;2867:153-168. doi: 10.1007/978-1-0716-4196-5_9.
使用三轨神经网络准确预测蛋白质结构和相互作用。
Science. 2021 Aug 20;373(6557):871-876. doi: 10.1126/science.abj8754. Epub 2021 Jul 15.
4
Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。
Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.
5
BLAST-QC: automated analysis of BLAST results.BLAST-QC:BLAST结果的自动化分析
Environ Microbiome. 2020 Aug 12;15(1):15. doi: 10.1186/s40793-020-00361-y.
6
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。
Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.
7
C20orf204, a hepatocellular carcinoma-specific protein interacts with nucleolin and promotes cell proliferation.C20orf204,一种肝细胞癌特异性蛋白,与核仁素相互作用并促进细胞增殖。
Oncogenesis. 2021 Mar 17;10(3):31. doi: 10.1038/s41389-021-00320-3.
8
New amino acid substitution matrix brings sequence alignments into agreement with structure matches.新的氨基酸替代矩阵使序列比对与结构匹配一致。
Proteins. 2021 Jun;89(6):671-682. doi: 10.1002/prot.26050. Epub 2021 Feb 2.
9
CATH: increased structural coverage of functional space.CATH:增加功能空间的结构覆盖率。
Nucleic Acids Res. 2021 Jan 8;49(D1):D266-D273. doi: 10.1093/nar/gkaa1079.
10
Many, but not all, lineage-specific genes can be explained by homology detection failure.许多(但不是全部)谱系特异性基因可以通过同源性检测失败来解释。
PLoS Biol. 2020 Nov 2;18(11):e3000862. doi: 10.1371/journal.pbio.3000862. eCollection 2020 Nov.