Zhang Zixun, Zhou Yuzhe, Zheng Jiayou, Feng Chunmei, Cui Shuguang, Wang Sheng, Li Zhen
FNii-Shenzhen, 2001 Longxiang Boulevard, Longgang District, Shenzhen, 518172, Guangdong, China; School of Science and Engineering, the Chinese University of Hong Kong (Shenzhen), 2001 Longxiang Boulevard, Longgang District, Shenzhen, 518172, Guangdong, China.
Institute of High Performance Computing, A*STAR, 1 Fusionopolis Way #16-16 Connexis, Singapore, 138632, Singapore.
Comput Biol Med. 2025 Sep;195:110607. doi: 10.1016/j.compbiomed.2025.110607. Epub 2025 Jun 30.
Large-scale Protein Language Models (PLMs), such as the Evolutionary Scale Modeling (ESM) family, have significantly advanced our understanding of protein structures and functions. These models have shown immense potential in biomedical applications, including drug discovery, protein design, and understanding disease mechanisms at the molecular level. However, PLMs are typically pre-trained on residue sequences alone, with limited incorporation of structural information, presenting opportunities for further enhancement. In this paper, we propose Structure Information Injecting Tuning (SI-Tuning), a parameter-efficient fine-tuning method, to integrate structural information into PLMs. SI-Tuning maintains the original model parameters in a frozen state while optimizing task-specific vectors for input embedding and attention maps. Structural features, including dihedral angles and distance maps, are used to derive this vector, injecting the structural information that improves model performance in downstream tasks. Extensive experiments on 650M ESM-2 demonstrate the effectiveness of our SI-Tuning across multiple downstream tasks. Specifically, our SI-Tuning achieves an accuracy of 93.95% on DeepLoc binary classification, and 76.05% on Metal Ion Binding, outperforming SaProt, a large-scale pre-trained PLM with structural modeling. SI-Tuning effectively enhances the performance of PLMs by incorporating structural information in a parameter-efficient manner. Our method not only advances downstream task performance, but also offers significant computational efficiency, making it a valuable strategy for applying large-scale PLM to broad biomedical downstream applications. Code is available at https://github.com/Nocturne0256/SI-tuning.
诸如进化尺度建模(ESM)家族这样的大规模蛋白质语言模型(PLM),极大地推进了我们对蛋白质结构和功能的理解。这些模型在生物医学应用中展现出了巨大潜力,包括药物发现、蛋白质设计以及在分子水平上理解疾病机制。然而,PLM通常仅在残基序列上进行预训练,对结构信息的整合有限,这为进一步改进提供了机会。在本文中,我们提出了结构信息注入微调(SI-Tuning),这是一种参数高效的微调方法,用于将结构信息整合到PLM中。SI-Tuning在冻结状态下保持原始模型参数,同时针对输入嵌入和注意力图优化特定任务向量。包括二面角和距离图在内的结构特征被用于推导此向量,注入能提升模型在下游任务中性能的结构信息。在6.5亿参数的ESM-2上进行的大量实验证明了我们的SI-Tuning在多个下游任务中的有效性。具体而言,我们的SI-Tuning在DeepLoc二元分类任务中达到了93.95%的准确率,在金属离子结合任务中达到了76.05%的准确率,优于具有结构建模功能的大规模预训练PLM——SaProt。SI-Tuning通过以参数高效的方式整合结构信息,有效地提升了PLM的性能。我们的方法不仅提高了下游任务的性能,还具有显著的计算效率,使其成为将大规模PLM应用于广泛生物医学下游应用的宝贵策略。代码可在https://github.com/Nocturne0256/SI-tuning获取。