• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

生物序列表示方法与最新进展:综述

Biological Sequence Representation Methods and Recent Advances: A Review.

作者信息

Zhang Hongwei, Shi Yan, Wang Yapeng, Yang Xu, Li Kefeng, Im Sio-Kei, Han Yu

机构信息

Faculty of Applied Sciences, Macao Polytechnic University, Macau 999078, China.

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China.

出版信息

Biology (Basel). 2025 Aug 27;14(9):1137. doi: 10.3390/biology14091137.

DOI:10.3390/biology14091137
PMID:41007283
Abstract

Biological-sequence representation methods are pivotal for advancing machine learning in computational biology, transforming nucleotide and protein sequences into formats that enhance predictive modeling and downstream task performance. This review categorizes these methods into three developmental stages: computational-based, word embedding-based, and large language model (LLM)-based, detailing their principles, applications, and limitations. Computational-based methods, such as k-mer counting and position-specific scoring matrices (PSSM), extract statistical and evolutionary patterns to support tasks like motif discovery and protein-protein interaction prediction. Word embedding-based approaches, including Word2Vec and GloVe, capture contextual relationships, enabling robust sequence classification and regulatory element identification. Advanced LLM-based methods, leveraging Transformer architectures like ESM3 and RNAErnie, model long-range dependencies for RNA structure prediction and cross-modal analysis, achieving superior accuracy. However, challenges persist, including computational complexity, sensitivity to data quality, and limited interpretability of high-dimensional embeddings. Future directions prioritize integrating multimodal data (e.g., sequences, structures, and functional annotations), employing sparse attention mechanisms to enhance efficiency, and leveraging explainable AI to bridge embeddings with biological insights. These advancements promise transformative applications in drug discovery, disease prediction, and genomics, empowering computational biology with robust, interpretable tools.

摘要

生物序列表示方法对于推动计算生物学中的机器学习至关重要,它将核苷酸和蛋白质序列转化为能增强预测建模和下游任务性能的格式。本综述将这些方法分为三个发展阶段:基于计算的、基于词嵌入的和基于大语言模型(LLM)的,并详细介绍了它们的原理、应用和局限性。基于计算的方法,如k-mer计数和位置特异性评分矩阵(PSSM),提取统计和进化模式以支持基序发现和蛋白质-蛋白质相互作用预测等任务。基于词嵌入的方法,包括Word2Vec和GloVe,捕捉上下文关系,实现强大的序列分类和调控元件识别。先进的基于LLM的方法,利用ESM3和RNAErnie等Transformer架构,对RNA结构预测和跨模态分析的长程依赖性进行建模,从而实现更高的准确性。然而,挑战依然存在,包括计算复杂性、对数据质量的敏感性以及高维嵌入的有限可解释性。未来的方向优先考虑整合多模态数据(如序列、结构和功能注释),采用稀疏注意力机制提高效率,并利用可解释人工智能将嵌入与生物学见解联系起来。这些进展有望在药物发现、疾病预测和基因组学中实现变革性应用,为计算生物学提供强大、可解释的工具。

相似文献

1
Biological Sequence Representation Methods and Recent Advances: A Review.生物序列表示方法与最新进展:综述
Biology (Basel). 2025 Aug 27;14(9):1137. doi: 10.3390/biology14091137.
2
The Use of AI for Phenotype-Genotype Mapping.人工智能在表型-基因型映射中的应用。
Methods Mol Biol. 2025;2952:369-410. doi: 10.1007/978-1-0716-4690-8_21.
3
Psychometric Evaluation of Large Language Model Embeddings for Personality Trait Prediction.用于人格特质预测的大语言模型嵌入的心理测量评估
J Med Internet Res. 2025 Jul 8;27:e75347. doi: 10.2196/75347.
4
Short-Term Memory Impairment短期记忆障碍
5
Shoulder Arthrogram肩关节造影
6
Large Language Model (LLM)-Based Advances in Prediction of Post-translational Modification Sites in Proteins.基于大语言模型(LLM)在蛋白质翻译后修饰位点预测方面的进展。
Methods Mol Biol. 2025;2941:313-355. doi: 10.1007/978-1-0716-4623-6_19.
7
Empowering Graph Neural Network-Based Computational Drug Repositioning with Large Language Model-Inferred Knowledge Representation.利用基于大语言模型推理的知识表示增强基于图神经网络的计算药物重新定位
Interdiscip Sci. 2024 Sep 26. doi: 10.1007/s12539-024-00654-7.
8
An attention-based mRNA transformer network for accurate prediction of melanoma response to immune checkpoint inhibitors.一种基于注意力机制的mRNA变压器网络,用于准确预测黑色素瘤对免疫检查点抑制剂的反应。
Sci Rep. 2025 Aug 29;15(1):31908. doi: 10.1038/s41598-025-15830-y.
9
Advances in cardiovascular signal analysis with future directions: a review of machine learning and deep learning models for cardiovascular disease classification based on ECG, PCG, and PPG signals.心血管信号分析进展及未来方向:基于心电图、心音图和光电容积脉搏波信号的心血管疾病分类机器学习与深度学习模型综述
Biomed Eng Lett. 2025 Apr 24;15(4):619-660. doi: 10.1007/s13534-025-00473-9. eCollection 2025 Jul.
10
iACP-DPNet: a dual-pooling causal dilated convolutional network for interpretable anticancer peptide identification.iACP-DPNet:一种用于可解释抗癌肽识别的双池因果扩张卷积网络。
Funct Integr Genomics. 2025 Jul 4;25(1):147. doi: 10.1007/s10142-025-01641-x.

本文引用的文献

1
Genomic language models with k-mer tokenization strategies for plant genome annotation and regulatory element strength prediction.用于植物基因组注释和调控元件强度预测的采用k-mer分词策略的基因组语言模型。
Plant Mol Biol. 2025 Jul 31;115(4):100. doi: 10.1007/s11103-025-01604-7.
2
AI-guided discovery and optimization of antimicrobial peptides through species-aware language model.通过物种感知语言模型进行人工智能引导的抗菌肽发现与优化
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf343.
3
PLM-ATG: Identification of Autophagy Proteins by Integrating Protein Language Model Embeddings with PSSM-Based Features.
PLM-ATG:通过将蛋白质语言模型嵌入与基于位置特异性得分矩阵的特征相结合来鉴定自噬蛋白
Molecules. 2025 Apr 10;30(8):1704. doi: 10.3390/molecules30081704.
4
Optimizing lipocalin sequence classification with ensemble deep learning models.使用集成深度学习模型优化脂钙蛋白序列分类
PLoS One. 2025 Apr 16;20(4):e0319329. doi: 10.1371/journal.pone.0319329. eCollection 2025.
5
N6-methyladenine identification using deep learning and discriminative feature integration.利用深度学习和判别特征整合进行N6-甲基腺嘌呤识别
BMC Med Genomics. 2025 Mar 29;18(1):58. doi: 10.1186/s12920-025-02131-6.
6
Large language model for knowledge synthesis and AI-enhanced biomanufacturing.
Trends Biotechnol. 2025 Aug;43(8):1864-1875. doi: 10.1016/j.tibtech.2025.02.008. Epub 2025 Mar 5.
7
Simulating 500 million years of evolution with a language model.用语言模型模拟5亿年的进化历程。
Science. 2025 Feb 21;387(6736):850-858. doi: 10.1126/science.ads0018. Epub 2025 Jan 16.
8
Accurate RNA 3D structure prediction using a language model-based deep learning approach.使用基于语言模型的深度学习方法进行准确的RNA三维结构预测。
Nat Methods. 2024 Dec;21(12):2287-2298. doi: 10.1038/s41592-024-02487-0. Epub 2024 Nov 21.
9
Hyperdimensional computing: A fast, robust, and interpretable paradigm for biological data.超高维计算:一种用于生物数据的快速、稳健且可解释的范例。
PLoS Comput Biol. 2024 Sep 24;20(9):e1012426. doi: 10.1371/journal.pcbi.1012426. eCollection 2024 Sep.
10
A hybrid residue based sequential encoding mechanism with XGBoost improved ensemble model for identifying 5-hydroxymethylcytosine modifications.基于残基的混合序贯编码机制与 XGBoost 改进的集成模型用于识别 5-羟甲基胞嘧啶修饰。
Sci Rep. 2024 Sep 6;14(1):20819. doi: 10.1038/s41598-024-71568-z.