Department of Electrical and Computer Engineering, Texas Tech University, TX, USA.
Department of Statistics, University of Nebraska - Lincoln, NB, USA.
Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac128.
Predicting protein properties from amino acid sequences is an important problem in biology and pharmacology. Protein-protein interactions among SARS-CoV-2 spike protein, human receptors and antibodies are key determinants of the potency of this virus and its ability to evade the human immune response. As a rapidly evolving virus, SARS-CoV-2 has already developed into many variants with considerable variation in virulence among these variants. Utilizing the proteomic data of SARS-CoV-2 to predict its viral characteristics will, therefore, greatly aid in disease control and prevention. In this paper, we review and compare recent successful prediction methods based on long short-term memory (LSTM), transformer, convolutional neural network (CNN) and a similarity-based topological regression (TR) model and offer recommendations about appropriate predictive methodology depending on the similarity between training and test datasets. We compare the effectiveness of these models in predicting the binding affinity and expression of SARS-CoV-2 spike protein sequences. We also explore how effective these predictive methods are when trained on laboratory-created data and are tasked with predicting the binding affinity of the in-the-wild SARS-CoV-2 spike protein sequences obtained from the GISAID datasets. We observe that TR is a better method when the sample size is small and test protein sequences are sufficiently similar to the training sequence. However, when the training sample size is sufficiently large and prediction requires extrapolation, LSTM embedding and CNN-based predictive model show superior performance.
从氨基酸序列预测蛋白质性质是生物学和药理学中的一个重要问题。SARS-CoV-2 刺突蛋白、人类受体和抗体之间的蛋白质-蛋白质相互作用是该病毒效力及其逃避人体免疫反应能力的关键决定因素。作为一种快速进化的病毒,SARS-CoV-2 已经发展出许多变体,这些变体在毒力方面存在相当大的差异。因此,利用 SARS-CoV-2 的蛋白质组学数据来预测其病毒特征将极大地有助于疾病控制和预防。在本文中,我们回顾和比较了基于长短期记忆(LSTM)、转换器、卷积神经网络(CNN)和基于相似性的拓扑回归(TR)模型的最近成功预测方法,并根据训练和测试数据集之间的相似性提供了关于适当预测方法的建议。我们比较了这些模型在预测 SARS-CoV-2 刺突蛋白序列的结合亲和力和表达方面的有效性。我们还探讨了这些预测方法在训练有实验室创建的数据并预测从 GISAID 数据集获得的野生型 SARS-CoV-2 刺突蛋白序列的结合亲和力时的有效性。我们观察到,当样本量较小时,TR 是一种更好的方法,并且测试蛋白序列与训练序列足够相似。然而,当训练样本量足够大且需要外推预测时,LSTM 嵌入和基于 CNN 的预测模型表现出更好的性能。