Peng Zhangzhi, Schussheim Benjamin, Chatterjee Pranam
Department of Biomedical Engineering, Duke University.
Department of Computer Science, Duke University.
bioRxiv. 2024 Feb 29:2024.02.28.581983. doi: 10.1101/2024.02.28.581983.
Proteins serve as the workhorses of living organisms, orchestrating a wide array of vital functions. Post-translational modifications (PTMs) of their amino acids greatly influence the structural and functional diversity of different protein types and uphold proteostasis, allowing cells to swiftly respond to environmental changes and intricately regulate complex biological processes. To this point, efforts to model the complex features of proteins have involved the training of large and expressive protein language models (pLMs) such as ESM-2 and ProtT5, which accurately encode structural, functional, and physicochemical properties of input protein sequences. However, the over 200 million sequences that these pLMs were trained on merely scratch the surface of proteomic diversity, as they neither input nor account for the effects of PTMs. In this work, we fill this major gap in protein sequence modeling by introducing PTM tokens into the pLM training regime. We then leverage recent advancements in structured state space models (SSMs), specifically Mamba, which utilizes efficient hardware-aware primitives to overcome the quadratic time complexities of Transformers. After adding a comprehensive set of PTM tokens to the model vocabulary, we train bidirectional Mamba blocks whose outputs are fused with state-of-the-art ESM-2 embeddings via a novel gating mechanism. We demonstrate that our resultant PTM-aware pLM, , improves upon ESM-2's performance on various PTM-specific tasks. PTM-Mamba is the first and only pLM that can uniquely input and represent both wild-type and PTM sequences, motivating downstream modeling and design applications specific to post-translationally modified proteins. To facilitate PTM-aware protein language modeling applications, we have made our model available at: https://huggingface.co/ChatterjeeLab/PTM-Mamba.
蛋白质是生物体的“主力军”,协调着一系列重要功能。其氨基酸的翻译后修饰(PTM)极大地影响了不同蛋白质类型的结构和功能多样性,并维持蛋白质稳态,使细胞能够迅速响应环境变化并精细调节复杂的生物过程。至此,对蛋白质复杂特征进行建模的工作涉及到训练大型且具有表现力的蛋白质语言模型(pLMs),如ESM-2和ProtT5,这些模型能够准确编码输入蛋白质序列的结构、功能和物理化学性质。然而,这些pLMs所训练的超过2亿个序列仅仅触及了蛋白质组多样性的表面,因为它们既不输入也不考虑PTM的影响。在这项工作中,我们通过将PTM标记引入pLM训练体系来填补蛋白质序列建模中的这一主要空白。然后,我们利用结构化状态空间模型(SSMs)的最新进展,特别是曼巴(Mamba),它利用高效的硬件感知原语来克服Transformer的二次时间复杂度。在向模型词汇表中添加了一组全面的PTM标记后,我们训练双向曼巴模块,其输出通过一种新颖的门控机制与最先进的ESM-2嵌入融合。我们证明,我们由此得到的具有PTM感知能力的pLM,即PTM-曼巴,在各种特定于PTM的任务上优于ESM-2的性能。PTM-曼巴是第一个也是唯一一个能够唯一输入和表示野生型和PTM序列的pLM,这推动了针对翻译后修饰蛋白质的下游建模和设计应用。为了促进具有PTM感知能力的蛋白质语言建模应用,我们已将我们的模型发布在:https://huggingface.co/ChatterjeeLab/PTM-Mamba 。