Chaturvedi Anusha, Borkar Kushal, Priyakumar U Deva, Vinod P K
International Institute of Information Technology, Hyderabad, Telangana, 500032, India.
Heliyon. 2023 Feb;9(2):e13646. doi: 10.1016/j.heliyon.2023.e13646. Epub 2023 Feb 11.
Coronavirus, a zoonotic virus capable of transmitting infections from animals to humans, emerged as a pandemic recently. In such circumstances, it is essential to understand the virus's origin. In this study, we present a novel machine-learning pipeline for host prediction of the family, Coronaviridae. We leverage the complete viral genome and sequences at the protein level (spike protein, membrane protein, and nucleocapsid protein). Compared with the current state-of-the-art approaches, the random forest model attained high accuracy and recall scores of 99.91% and 0.98, respectively, for genome sequences. In addition to the spike protein sequences, our study shows membrane and nucleocapsid protein sequences can be utilized to predict the host of viruses. We also identified important sites in the viral sequences that help distinguish between different host classes. The host prediction pipeline will cater as a valuable tool to take effective measures to govern the transmission of future viruses.
冠状病毒是一种能够将感染从动物传播给人类的人畜共患病毒,最近成为了一种大流行病。在这种情况下,了解该病毒的起源至关重要。在本研究中,我们提出了一种用于冠状病毒科宿主预测的新型机器学习流程。我们利用完整的病毒基因组和蛋白质水平的序列(刺突蛋白、膜蛋白和核衣壳蛋白)。与当前最先进的方法相比,随机森林模型在基因组序列方面分别获得了99.91%和0.98的高精度和召回率。除了刺突蛋白序列外,我们的研究表明膜蛋白和核衣壳蛋白序列也可用于预测病毒的宿主。我们还在病毒序列中确定了有助于区分不同宿主类别的重要位点。宿主预测流程将作为一种有价值的工具,用于采取有效措施控制未来病毒的传播。