Suppr超能文献

PWM2Vec:一种基于冠状病毒刺突序列进行病毒宿主特异性分析的高效嵌入方法。

PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences.

作者信息

Ali Sarwan, Bello Babatunde, Chourasia Prakash, Punathil Ria Thazhe, Zhou Yijing, Patterson Murray

机构信息

Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA.

出版信息

Biology (Basel). 2022 Mar 9;11(3):418. doi: 10.3390/biology11030418.

Abstract

The study of host specificity has important connections to the question about the origin of SARS-CoV-2 in humans which led to the COVID-19 pandemic-an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona)viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating, and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is important in determining host specificity, since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among birds, bats, camels, swine, humans, and weasels, to name a few. We propose a feature embedding based on the well-known position weight matrix (PWM), which we call PWM2Vec, and we use it to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications, such as determining protein function and identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs from viral sequences to generate fixed-length feature vector representations, and use them in the context of host classification. The results on real world data show that when using PWM2Vec, machine learning classifiers are able to perform comparably to the baseline models in terms of predictive performance and runtime-in some cases, the performance is better. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus. Finally, we perform some statistical analyses on these results to show that our embedding is more compact than the embeddings of the baseline models.

摘要

宿主特异性研究与导致新冠疫情的新冠病毒在人类中的起源问题有着重要联系,这是一个重要的开放性问题。有人猜测蝙蝠可能是起源。同样,还有许多密切相关的(冠状)病毒,比如非典病毒,它被发现是通过果子狸传播的。研究那些可能成为致命病毒的潜在携带者并将其传播给人类的不同宿主,对于理解、缓解和预防当前及未来的大流行至关重要。在冠状病毒中,表面(S)蛋白,即刺突蛋白,在决定宿主特异性方面很重要,因为它是病毒与宿主细胞膜的接触点。在本文中,我们根据五千多种冠状病毒的刺突蛋白序列对其宿主进行分类,将它们分为鸟类、蝙蝠、骆驼、猪、人类、鼬类等不同宿主的簇。我们基于著名的位置权重矩阵(PWM)提出了一种特征嵌入方法,我们称之为PWM2Vec,并使用它从这些冠状病毒的刺突蛋白序列中生成特征向量。虽然我们的嵌入方法受到PWM在生物学应用(如确定蛋白质功能和识别转录因子结合位点)中成功的启发,但据我们所知,我们是首个使用病毒序列中的PWM来生成固定长度的特征向量表示,并将其用于宿主分类的研究。实际数据结果表明,使用PWM2Vec时,机器学习分类器在预测性能和运行时间方面能够与基线模型相媲美——在某些情况下,性能更好。我们还使用信息增益来衡量不同氨基酸的重要性,以显示对于预测给定冠状病毒宿主重要的氨基酸。最后,我们对这些结果进行了一些统计分析,以表明我们的嵌入比基线模型的嵌入更紧凑。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/760e/8945605/f1a33e566ffc/biology-11-00418-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验