Zhang Hongwei, Shi Yan, Wang Yapeng, Yang Xu, Li Kefeng, Im Sio-Kei, Han Yu
Faculty of Applied Sciences, Macao Polytechnic University, Macau 999078, China.
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China.
Biology (Basel). 2025 Aug 27;14(9):1137. doi: 10.3390/biology14091137.
Biological-sequence representation methods are pivotal for advancing machine learning in computational biology, transforming nucleotide and protein sequences into formats that enhance predictive modeling and downstream task performance. This review categorizes these methods into three developmental stages: computational-based, word embedding-based, and large language model (LLM)-based, detailing their principles, applications, and limitations. Computational-based methods, such as k-mer counting and position-specific scoring matrices (PSSM), extract statistical and evolutionary patterns to support tasks like motif discovery and protein-protein interaction prediction. Word embedding-based approaches, including Word2Vec and GloVe, capture contextual relationships, enabling robust sequence classification and regulatory element identification. Advanced LLM-based methods, leveraging Transformer architectures like ESM3 and RNAErnie, model long-range dependencies for RNA structure prediction and cross-modal analysis, achieving superior accuracy. However, challenges persist, including computational complexity, sensitivity to data quality, and limited interpretability of high-dimensional embeddings. Future directions prioritize integrating multimodal data (e.g., sequences, structures, and functional annotations), employing sparse attention mechanisms to enhance efficiency, and leveraging explainable AI to bridge embeddings with biological insights. These advancements promise transformative applications in drug discovery, disease prediction, and genomics, empowering computational biology with robust, interpretable tools.
生物序列表示方法对于推动计算生物学中的机器学习至关重要,它将核苷酸和蛋白质序列转化为能增强预测建模和下游任务性能的格式。本综述将这些方法分为三个发展阶段:基于计算的、基于词嵌入的和基于大语言模型(LLM)的,并详细介绍了它们的原理、应用和局限性。基于计算的方法,如k-mer计数和位置特异性评分矩阵(PSSM),提取统计和进化模式以支持基序发现和蛋白质-蛋白质相互作用预测等任务。基于词嵌入的方法,包括Word2Vec和GloVe,捕捉上下文关系,实现强大的序列分类和调控元件识别。先进的基于LLM的方法,利用ESM3和RNAErnie等Transformer架构,对RNA结构预测和跨模态分析的长程依赖性进行建模,从而实现更高的准确性。然而,挑战依然存在,包括计算复杂性、对数据质量的敏感性以及高维嵌入的有限可解释性。未来的方向优先考虑整合多模态数据(如序列、结构和功能注释),采用稀疏注意力机制提高效率,并利用可解释人工智能将嵌入与生物学见解联系起来。这些进展有望在药物发现、疾病预测和基因组学中实现变革性应用,为计算生物学提供强大、可解释的工具。