Elkin Magdalyn E, Zhu Xingquan
Dept. Electrical Engineering and Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL, 33431, USA.
Commun Biol. 2025 Jan 21;8(1):98. doi: 10.1038/s42003-024-07262-7.
Predicting novel mutations has long-lasting impacts on life science research. Traditionally, this problem is addressed through wet-lab experiments, which are often expensive and time consuming. The recent advancement in neural language models has provided stunning results in modeling and deciphering sequences. In this paper, we propose a Deep Novel Mutation Search (DNMS) method, using deep neural networks, to model protein sequence for mutation prediction. We use SARS-CoV-2 spike protein as the target and use a protein language model to predict novel mutations. Different from existing research which is often limited to mutating the reference sequence for prediction, we propose a parent-child mutation prediction paradigm where a parent sequence is modeled for mutation prediction. Because mutations introduce changing context to the underlying sequence, DNMS models three aspects of the protein sequences: semantic changes, grammatical changes, and attention changes, each modeling protein sequence aspects from shifting of semantics, grammar coherence, and amino-acid interactions in latent space. A ranking approach is proposed to combine all three aspects to capture mutations demonstrating evolving traits, in accordance with real-world SARS-CoV-2 spike protein sequence evolution. DNMS can be adopted for an early warning variant detection system, creating public health awareness of future SARS-CoV-2 mutations.
预测新出现的突变对生命科学研究有着持久的影响。传统上,这个问题是通过湿实验室实验来解决的,而这些实验往往既昂贵又耗时。神经语言模型的最新进展在序列建模和解码方面取得了惊人的成果。在本文中,我们提出了一种深度新突变搜索(DNMS)方法,利用深度神经网络对蛋白质序列进行建模以预测突变。我们以严重急性呼吸综合征冠状病毒2(SARS-CoV-2)刺突蛋白为目标,并使用蛋白质语言模型来预测新出现的突变。与现有研究通常局限于对参考序列进行突变以进行预测不同,我们提出了一种亲子突变预测范式,即对一个亲本序列进行建模以预测突变。由于突变会给基础序列引入不断变化的上下文,DNMS对蛋白质序列的三个方面进行建模:语义变化、语法变化和注意力变化,每个方面都从潜在空间中的语义转移、语法连贯性和氨基酸相互作用来对蛋白质序列方面进行建模。我们提出了一种排序方法,将这三个方面结合起来,以捕捉显示出进化特征的突变,这与现实世界中SARS-CoV-2刺突蛋白序列的进化情况一致。DNMS可用于早期预警变异检测系统,提高公众对未来SARS-CoV-2突变的健康意识。