Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.
Department of Virology, School of Medicine, Iran University of Medical Sciences, Tehran, Iran.
Sci Rep. 2023 Sep 11;13(1):14944. doi: 10.1038/s41598-023-42089-y.
The influenza virus hemagglutinin is an important part of the virus attachment to the host cells. The hemagglutinin proteins are one of the genetic regions of the virus with a high potential for mutations. Due to the importance of predicting mutations in producing effective and low-cost vaccines, solutions that attempt to approach this problem have recently gained significant attention. A historical record of mutations has been used to train predictive models in such solutions. However, the imbalance between mutations and preserved proteins is a big challenge for the development of such models that need to be addressed. Here, we propose to tackle this challenge through anomaly detection (AD). AD is a well-established field in Machine Learning (ML) that tries to distinguish unseen anomalies from normal patterns using only normal training samples. By considering mutations as anomalous behavior, we could benefit existing rich solutions in this field that have emerged recently. Such methods also fit the problem setup of extreme imbalance between the number of unmutated vs. mutated training samples. Motivated by this formulation, our method tries to find a compact representation for unmutated samples while forcing anomalies to be separated from the normal ones. This helps the model to learn a shared unique representation between normal training samples as much as possible, which improves the discernibility and detectability of mutated samples from the unmutated ones at the test time. We conduct a large number of experiments on four publicly available datasets, consisting of three different hemagglutinin protein datasets, and one SARS-CoV-2 dataset, and show the effectiveness of our method through different standard criteria.
流感病毒血凝素是病毒附着在宿主细胞上的重要部分。血凝素蛋白是病毒具有高度突变潜力的遗传区域之一。由于预测突变对于生产有效和低成本疫苗至关重要,因此最近有解决方案试图解决这个问题,这些解决方案最近引起了广泛关注。历史突变记录已被用于训练此类解决方案中的预测模型。然而,突变和保留蛋白之间的不平衡是开发此类模型的一个重大挑战,需要加以解决。在这里,我们通过异常检测(AD)来解决这个挑战。AD 是机器学习(ML)中的一个成熟领域,它试图仅使用正常训练样本来区分未见的异常和正常模式。通过将突变视为异常行为,我们可以利用最近出现的该领域中丰富的现有解决方案。这些方法也适合突变与未突变训练样本数量之间存在极端不平衡的问题设置。受此公式的启发,我们的方法试图找到未突变样本的紧凑表示,同时迫使异常与正常样本分离。这有助于模型在正常训练样本之间尽可能多地学习共享的独特表示,从而提高突变样本与未突变样本在测试时的可辨别性和可检测性。我们在四个公开可用的数据集上进行了大量实验,其中包括三个不同的血凝素蛋白数据集和一个 SARS-CoV-2 数据集,并通过不同的标准标准证明了我们方法的有效性。