一种预测逆转录病毒核衣壳蛋白的计算方法。

A computational method for predicting nucleocapsid protein in retroviruses.

机构信息

Cardiovascular Department, The First Affiliated Hospital of Xi'an Jiaotong University, No. 277 W. Yanta Road, Xi'an, 710061, Shaanxi, People's Republic of China.

School of Electronics & Control Engineering, Chang'an University, Middle Section of Nan Er Huan, Xi'an, 710064, Shaanxi, People's Republic of China.

出版信息

Sci Rep. 2022 Jan 11;12(1):524. doi: 10.1038/s41598-021-03182-2.

DOI:10.1038/s41598-021-03182-2

PMID:35017554

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8752852/

Abstract

Nucleocapsid protein (NC) in the group-specific antigen (gag) of retrovirus is essential in the interactions of most retroviral gag proteins with RNAs. Computational method to predict NCs would benefit subsequent structure analysis and functional study on them. However, no computational method to predict the exact locations of NCs in retroviruses has been proposed yet. The wide range of length variation of NCs also increases the difficulties. In this paper, a computational method to identify NCs in retroviruses is proposed. All available retrovirus sequences with NC annotations were collected from NCBI. Models based on random forest (RF) and weighted support vector machine (WSVM) were built to predict initiation and termination sites of NCs. Factor analysis scales of generalized amino acid information along with position weight matrix were utilized to generate the feature space. Homology based gene prediction methods were also compared and integrated to bring out better predicting performance. Candidate initiation and termination sites predicted were then combined and screened according to their intervals, decision values and alignment scores. All available gag sequences without NC annotations were scanned with the model to detect putative NCs. Geometric means of sensitivity and specificity generated from prediction of initiation and termination sites under fivefold cross-validation are 0.9900 and 0.9548 respectively. 90.91% of all the collected retrovirus sequences with NC annotations could be predicted totally correct by the model combining WSVM, RF and simple alignment. The composite model performs better than the simplex ones. 235 putative NCs in unannotated gags were detected by the model. Our prediction method performs well on NC recognition and could also be expanded to solve other gene prediction problems, especially those whose training samples have large length variations.

摘要

核衣壳蛋白（NC）是逆转录病毒群特异性抗原（gag）中的一种重要蛋白，它在大多数逆转录病毒 gag 蛋白与 RNA 的相互作用中发挥作用。预测 NC 的计算方法将有助于对它们进行后续的结构分析和功能研究。然而，目前还没有提出预测逆转录病毒中 NC 的确切位置的计算方法。NC 长度变化范围广泛也增加了难度。本文提出了一种预测逆转录病毒中 NC 的计算方法。从 NCBI 收集了具有 NC 注释的所有可用逆转录病毒序列。基于随机森林（RF）和加权支持向量机（WSVM）的模型被构建来预测 NC 的起始和终止位点。利用广义氨基酸信息的因子分析尺度和位置权重矩阵来生成特征空间。还比较和整合了基于同源性的基因预测方法，以提高预测性能。然后根据它们的间隔、决策值和比对得分，将预测的候选起始和终止位点进行组合和筛选。用模型扫描所有没有 NC 注释的 gag 序列，以检测潜在的 NC。在五次交叉验证下，预测起始和终止位点的敏感性和特异性的几何平均值分别为 0.9900 和 0.9548。通过结合 WSVM、RF 和简单比对的模型，可以正确预测 90.91%的具有 NC 注释的所有收集的逆转录病毒序列。复合模型的性能优于简单模型。通过模型检测到 235 个未注释 gag 中的潜在 NC。我们的预测方法在 NC 识别方面表现良好，也可以扩展到解决其他基因预测问题，特别是那些训练样本具有较大长度变化的问题。