Vincoff Sophia, Goel Shrey, Kholina Kseniia, Pulugurta Rishab, Vure Pranay, Chatterjee Pranam
Department of Biomedical Engineering, Duke University, Durham, NC, USA.
Department of Computer Science, Duke University, Durham, NC, USA.
Nat Commun. 2025 Feb 7;16(1):1436. doi: 10.1038/s41467-025-56745-6.
Fusion oncoproteins, a class of chimeric proteins arising from chromosomal translocations, are major drivers of various pediatric cancers. These proteins are intrinsically disordered and lack druggable pockets, making them highly challenging therapeutic targets for both small molecule-based and structure-based approaches. Protein language models (pLMs) have recently emerged as powerful tools for capturing physicochemical and functional protein features but have yet to be trained on fusion oncoprotein sequences. We introduce FusOn-pLM, a fine-tuned pLM trained on a newly curated, comprehensive set of fusion oncoprotein sequences, FusOn-DB. Employing a unique cosine-scheduled masked language modeling strategy, FusOn-pLM dynamically adjusts masking rates (15%-40%) to optimize feature extraction and representation quality, surpassing baseline embeddings in fusion-specific tasks, including localization, puncta formation, and disorder prediction. FusOn-pLM uniquely predicts drug-resistant mutations, providing insights for therapeutic design that anticipates resistance mechanisms. In total, FusOn-pLM provides biologically relevant representations for advancing therapeutic discovery in fusion-driven cancers.
融合癌蛋白是一类由染色体易位产生的嵌合蛋白,是多种儿童癌症的主要驱动因素。这些蛋白本质上是无序的,缺乏可成药口袋,这使得它们对于基于小分子和基于结构的方法而言都是极具挑战性的治疗靶点。蛋白质语言模型(pLMs)最近已成为捕获蛋白质物理化学和功能特征的强大工具,但尚未在融合癌蛋白序列上进行训练。我们引入了FusOn-pLM,这是一种在新策划的、全面的融合癌蛋白序列集FusOn-DB上训练的微调pLM。采用独特的余弦调度掩码语言建模策略,FusOn-pLM动态调整掩码率(15%-40%)以优化特征提取和表示质量,在融合特异性任务(包括定位、斑点形成和无序预测)中超越基线嵌入。FusOn-pLM独特地预测耐药突变,为预测耐药机制的治疗设计提供见解。总的来说,FusOn-pLM为推进融合驱动癌症的治疗发现提供了生物学相关的表示。