蛋白质序列分析全景：任务类型、数据库、数据集、词嵌入方法和语言模型的系统综述

Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.

作者信息

Asim Muhammad Nabeel, Asif Tayyaba, Hassan Faiza, Dengel Andreas

机构信息

German Research Center for Artificial Intelligence, Kaiserslautern 67663, Germany.

Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany.

出版信息

Database (Oxford). 2025 May 30;2025. doi: 10.1093/database/baaf027.

DOI:10.1093/database/baaf027

PMID:40448683

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12125710/

Abstract

Protein sequence analysis examines the order of amino acids within protein sequences to unlock diverse types of a wealth of knowledge about biological processes and genetic disorders. It helps in forecasting disease susceptibility by finding unique protein signatures, or biomarkers that are linked to particular disease states. Protein Sequence analysis through wet-lab experiments is expensive, time-consuming and error prone. To facilitate large-scale proteomics sequence analysis, the biological community is striving for utilizing AI competence for transitioning from wet-lab to computer aided applications. However, Proteomics and AI are two distinct fields and development of AI-driven protein sequence analysis applications requires knowledge of both domains. To bridge the gap between both fields, various review articles have been written. However, these articles focus revolves around few individual tasks or specific applications rather than providing a comprehensive overview about wide tasks and applications. Following the need of a comprehensive literature that presents a holistic view of wide array of tasks and applications, contributions of this manuscript are manifold: It bridges the gap between Proteomics and AI fields by presenting a comprehensive array of AI-driven applications for 63 distinct protein sequence analysis tasks. It equips AI researchers by facilitating biological foundations of 63 protein sequence analysis tasks. It enhances development of AI-driven protein sequence analysis applications by providing comprehensive details of 68 protein databases. It presents a rich data landscape, encompassing 627 benchmark datasets of 63 diverse protein sequence analysis tasks. It highlights the utilization of 25 unique word embedding methods and 13 language models in AI-driven protein sequence analysis applications. It accelerates the development of AI-driven applications by facilitating current state-of-the-art performances across 63 protein sequence analysis tasks.

摘要

蛋白质序列分析通过检查蛋白质序列中氨基酸的顺序，来揭示有关生物过程和遗传疾病的丰富多样的知识。它有助于通过寻找独特的蛋白质特征或与特定疾病状态相关的生物标志物来预测疾病易感性。通过湿实验室实验进行蛋白质序列分析成本高昂、耗时且容易出错。为了促进大规模蛋白质组学序列分析，生物界正在努力利用人工智能能力，从湿实验室应用向计算机辅助应用转变。然而，蛋白质组学和人工智能是两个不同的领域，开发人工智能驱动的蛋白质序列分析应用需要这两个领域的知识。为了弥合这两个领域之间的差距，已经撰写了各种综述文章。然而，这些文章主要围绕少数个别任务或特定应用，而不是对广泛的任务和应用提供全面的概述。鉴于需要一篇全面的文献来全面介绍广泛的任务和应用，本手稿的贡献是多方面的：它通过展示一系列用于63个不同蛋白质序列分析任务的人工智能驱动应用，弥合了蛋白质组学和人工智能领域之间的差距。它通过介绍63个蛋白质序列分析任务的生物学基础，为人工智能研究人员提供了相关知识。它通过提供68个蛋白质数据库的详细信息，促进了人工智能驱动的蛋白质序列分析应用的开发。它展示了丰富的数据景观，涵盖了63个不同蛋白质序列分析任务的627个基准数据集。它强调了在人工智能驱动的蛋白质序列分析应用中使用25种独特的词嵌入方法和13种语言模型。它通过展示63个蛋白质序列分析任务的当前最先进性能，加速了人工智能驱动应用的开发。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

蛋白质序列分析全景：任务类型、数据库、数据集、词嵌入方法和语言模型的系统综述

Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

蛋白质序列分析全景：任务类型、数据库、数据集、词嵌入方法和语言模型的系统综述

Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models.

作者信息

机构信息

出版信息

相似文献

本文引用的文献