• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

蛋白质序列分析全景:任务类型、数据库、数据集、词嵌入方法和语言模型的系统综述

Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.

作者信息

Asim Muhammad Nabeel, Asif Tayyaba, Hassan Faiza, Dengel Andreas

机构信息

German Research Center for Artificial Intelligence, Kaiserslautern 67663, Germany.

Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany.

出版信息

Database (Oxford). 2025 May 30;2025. doi: 10.1093/database/baaf027.

DOI:10.1093/database/baaf027
PMID:40448683
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12125710/
Abstract

Protein sequence analysis examines the order of amino acids within protein sequences to unlock diverse types of a wealth of knowledge about biological processes and genetic disorders. It helps in forecasting disease susceptibility by finding unique protein signatures, or biomarkers that are linked to particular disease states. Protein Sequence analysis through wet-lab experiments is expensive, time-consuming and error prone. To facilitate large-scale proteomics sequence analysis, the biological community is striving for utilizing AI competence for transitioning from wet-lab to computer aided applications. However, Proteomics and AI are two distinct fields and development of AI-driven protein sequence analysis applications requires knowledge of both domains. To bridge the gap between both fields, various review articles have been written. However, these articles focus revolves around few individual tasks or specific applications rather than providing a comprehensive overview about wide tasks and applications. Following the need of a comprehensive literature that presents a holistic view of wide array of tasks and applications, contributions of this manuscript are manifold: It bridges the gap between Proteomics and AI fields by presenting a comprehensive array of AI-driven applications for 63 distinct protein sequence analysis tasks. It equips AI researchers by facilitating biological foundations of 63 protein sequence analysis tasks. It enhances development of AI-driven protein sequence analysis applications by providing comprehensive details of 68 protein databases. It presents a rich data landscape, encompassing 627 benchmark datasets of 63 diverse protein sequence analysis tasks. It highlights the utilization of 25 unique word embedding methods and 13 language models in AI-driven protein sequence analysis applications. It accelerates the development of AI-driven applications by facilitating current state-of-the-art performances across 63 protein sequence analysis tasks.

摘要

蛋白质序列分析通过检查蛋白质序列中氨基酸的顺序,来揭示有关生物过程和遗传疾病的丰富多样的知识。它有助于通过寻找独特的蛋白质特征或与特定疾病状态相关的生物标志物来预测疾病易感性。通过湿实验室实验进行蛋白质序列分析成本高昂、耗时且容易出错。为了促进大规模蛋白质组学序列分析,生物界正在努力利用人工智能能力,从湿实验室应用向计算机辅助应用转变。然而,蛋白质组学和人工智能是两个不同的领域,开发人工智能驱动的蛋白质序列分析应用需要这两个领域的知识。为了弥合这两个领域之间的差距,已经撰写了各种综述文章。然而,这些文章主要围绕少数个别任务或特定应用,而不是对广泛的任务和应用提供全面的概述。鉴于需要一篇全面的文献来全面介绍广泛的任务和应用,本手稿的贡献是多方面的:它通过展示一系列用于63个不同蛋白质序列分析任务的人工智能驱动应用,弥合了蛋白质组学和人工智能领域之间的差距。它通过介绍63个蛋白质序列分析任务的生物学基础,为人工智能研究人员提供了相关知识。它通过提供68个蛋白质数据库的详细信息,促进了人工智能驱动的蛋白质序列分析应用的开发。它展示了丰富的数据景观,涵盖了63个不同蛋白质序列分析任务的627个基准数据集。它强调了在人工智能驱动的蛋白质序列分析应用中使用25种独特的词嵌入方法和13种语言模型。它通过展示63个蛋白质序列分析任务的当前最先进性能,加速了人工智能驱动应用的开发。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05da/12125710/4162ad6864d1/baaf027f8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05da/12125710/b11fed2e751f/baaf027f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05da/12125710/e757d716a6ff/baaf027f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05da/12125710/4e8884f72f63/baaf027f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05da/12125710/f3d63cb0b71d/baaf027f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05da/12125710/ca8fc3021f7f/baaf027f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05da/12125710/aaf8d93d094c/baaf027f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05da/12125710/033082879274/baaf027f7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05da/12125710/4162ad6864d1/baaf027f8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05da/12125710/b11fed2e751f/baaf027f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05da/12125710/e757d716a6ff/baaf027f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05da/12125710/4e8884f72f63/baaf027f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05da/12125710/f3d63cb0b71d/baaf027f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05da/12125710/ca8fc3021f7f/baaf027f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05da/12125710/aaf8d93d094c/baaf027f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05da/12125710/033082879274/baaf027f7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/05da/12125710/4162ad6864d1/baaf027f8.jpg

相似文献

1
Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.蛋白质序列分析全景:任务类型、数据库、数据集、词嵌入方法和语言模型的系统综述
Database (Oxford). 2025 May 30;2025. doi: 10.1093/database/baaf027.
2
RNA sequence analysis landscape: A comprehensive review of task types, databases, datasets, word embedding methods, and language models.RNA序列分析全景:任务类型、数据库、数据集、词嵌入方法及语言模型的全面综述
Heliyon. 2025 Jan 6;11(2):e41488. doi: 10.1016/j.heliyon.2024.e41488. eCollection 2025 Jan 30.
3
DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models.DNA序列分析全景:对DNA序列分析任务类型、数据库、数据集、词嵌入方法和语言模型的全面综述。
Front Med (Lausanne). 2025 Apr 8;12:1503229. doi: 10.3389/fmed.2025.1503229. eCollection 2025.
4
Peptide classification landscape: An in-depth systematic literature review on peptide types, databases, datasets, predictors architectures and performance.肽分类全景:关于肽类型、数据库、数据集、预测器架构及性能的深入系统文献综述
Comput Biol Med. 2025 Apr;188:109821. doi: 10.1016/j.compbiomed.2025.109821. Epub 2025 Feb 22.
5
Transitioning from wet lab to artificial intelligence: a systematic review of AI predictors in CRISPR.从湿实验室到人工智能的转变:对CRISPR中人工智能预测因子的系统综述
J Transl Med. 2025 Feb 4;23(1):153. doi: 10.1186/s12967-024-06013-w.
6
Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。
BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.
7
Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field.蛋白质科学与人工智能相遇:跨领域的系统评价与生化荟萃分析
Front Bioeng Biotechnol. 2022 Jul 7;10:788300. doi: 10.3389/fbioe.2022.788300. eCollection 2022.
8
Comparing the Artificial Intelligence Detection Models to Standard Diagnostic Methods and Alternative Models in Identifying Alzheimer's Disease in At-Risk or Early Symptomatic Individuals: A Scoping Review.比较人工智能检测模型与标准诊断方法及替代模型在识别有风险或早期有症状个体的阿尔茨海默病中的应用:一项范围综述
Cureus. 2024 Dec 9;16(12):e75389. doi: 10.7759/cureus.75389. eCollection 2024 Dec.
9
ATOMMIC: An Advanced Toolbox for Multitask Medical Imaging Consistency to facilitate Artificial Intelligence applications from acquisition to analysis in Magnetic Resonance Imaging.ATOMMIC:一个高级的多任务医学成像一致性工具箱,旨在促进磁共振成像从采集到分析的人工智能应用。
Comput Methods Programs Biomed. 2024 Nov;256:108377. doi: 10.1016/j.cmpb.2024.108377. Epub 2024 Aug 22.
10
How did we get there? AI applications to biological networks and sequences.我们是如何走到这一步的?人工智能在生物网络和序列中的应用。
Comput Biol Med. 2025 May;190:110064. doi: 10.1016/j.compbiomed.2025.110064. Epub 2025 Apr 3.

本文引用的文献

1
The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients.大语言模型在基于表型对罕见病患者致病基因进行优先级排序中的应用。
Sci Rep. 2025 Apr 29;15(1):15093. doi: 10.1038/s41598-025-99539-y.
2
DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models.DNA序列分析全景:对DNA序列分析任务类型、数据库、数据集、词嵌入方法和语言模型的全面综述。
Front Med (Lausanne). 2025 Apr 8;12:1503229. doi: 10.3389/fmed.2025.1503229. eCollection 2025.
3
Combining evolution and protein language models for an interpretable cancer driver mutation prediction with D2Deep.
结合进化与蛋白质语言模型,利用D2Deep进行可解释的癌症驱动基因突变预测。
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae664.
4
Bilingual language model for protein sequence and structure.用于蛋白质序列和结构的双语语言模型。
NAR Genom Bioinform. 2024 Nov 15;6(4):lqae150. doi: 10.1093/nargab/lqae150. eCollection 2024 Dec.
5
Prediction of virus-host associations using protein language models and multiple instance learning.使用蛋白质语言模型和多实例学习预测病毒-宿主关联
PLoS Comput Biol. 2024 Nov 19;20(11):e1012597. doi: 10.1371/journal.pcbi.1012597. eCollection 2024 Nov.
6
Parameter-efficient fine-tuning on large protein language models improves signal peptide prediction.大型蛋白质语言模型的参数高效微调可提高信号肽预测的效果。
Genome Res. 2024 Oct 11;34(9):1445-1454. doi: 10.1101/gr.279132.124.
7
Predicting protein functions using positive-unlabeled ranking with ontology-based priors.基于本体论先验的正无标记排序预测蛋白质功能。
Bioinformatics. 2024 Jun 28;40(Suppl 1):i401-i409. doi: 10.1093/bioinformatics/btae237.
8
DSSGNN-PPI: A Protein-Protein Interactions prediction model based on Double Structure and Sequence graph neural networks.DSSGNN-PPI:一种基于双结构和序列图神经网络的蛋白质-蛋白质相互作用预测模型。
Comput Biol Med. 2024 Jul;177:108669. doi: 10.1016/j.compbiomed.2024.108669. Epub 2024 May 29.
9
Prediction of peptide hormones using an ensemble of machine learning and similarity-based methods.基于机器学习和基于相似度的方法的组合预测肽激素。
Proteomics. 2024 Oct;24(20):e2400004. doi: 10.1002/pmic.202400004. Epub 2024 May 27.
10
SCLpred-ECL: Subcellular Localization Prediction by Deep N-to-1 Convolutional Neural Networks.SCLpred-ECL:基于深度 N-to-1 卷积神经网络的亚细胞定位预测。
Int J Mol Sci. 2024 May 16;25(10):5440. doi: 10.3390/ijms25105440.