Ma Yucheng, Liu Ruiling, Lv Hongqiang, Han Jiuqiang, Zhong Dexing, Zhang Xinman
School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China.
PLoS One. 2017 May 4;12(5):e0176909. doi: 10.1371/journal.pone.0176909. eCollection 2017.
Human endogenous retroviruses (HERVs) encode active retroviral proteins, which may be involved in the progression of cancer and other diseases. Matrix protein (MA), in group-specific antigen genes (gag) of retroviruses, is associated with the virus envelope glycoproteins in most mammalian retroviruses and may be involved in virus particle assembly, transport and budding. However, the amount of annotated MAs in ERVs is still at a low level so far. No computational method to predict the exact start and end coordinates of MAs in gags has been proposed yet. In this paper, a computational method to identify MAs in ERVs is proposed. A divide and conquer technique was designed and applied to the conventional prediction model to acquire better results when dealing with gene sequences with various lengths. Initiation sites and termination sites were predicted separately and then combined according to their intervals. Three different algorithms were applied and compared: weighted support vector machine (WSVM), weighted extreme learning machine (WELM) and random forest (RF). G - mean (geometric mean of sensitivity and specificity) values of initiation sites and termination sites under 5-fold cross validation generated by random forest models are 0.9869 and 0.9755 respectively, highest among the algorithms applied. Our prediction models combine RF & WSVM algorithms to achieve the best prediction results. 98.4% of all the collected ERV sequences with complete MAs (125 in total) could be predicted exactly correct by the models. 94,671 HERV sequences from 118 families were scanned by the model, 104 new putative MAs were predicted in human chromosomes. Distributions of the putative MAs and optimizations of model parameters were also analyzed. The usage of our predicting method was also expanded to other retroviruses and satisfying results were acquired.
人类内源性逆转录病毒(HERV)编码活性逆转录病毒蛋白,这些蛋白可能参与癌症和其他疾病的发展进程。逆转录病毒群特异性抗原基因(gag)中的基质蛋白(MA),在大多数哺乳动物逆转录病毒中与病毒包膜糖蛋白相关,可能参与病毒颗粒的组装、运输和出芽。然而,到目前为止,ERV中注释的MA数量仍然处于较低水平。尚未提出预测gag中MA的确切起始和终止坐标的计算方法。本文提出了一种识别ERV中MA的计算方法。设计了一种分治技术并将其应用于传统预测模型,以便在处理不同长度的基因序列时获得更好的结果。分别预测起始位点和终止位点,然后根据它们的间隔进行组合。应用并比较了三种不同的算法:加权支持向量机(WSVM)、加权极限学习机(WELM)和随机森林(RF)。随机森林模型在5折交叉验证下生成的起始位点和终止位点的G均值(敏感性和特异性的几何平均值)分别为0.9869和0.9755,在所应用的算法中最高。我们的预测模型结合了RF和WSVM算法以实现最佳预测结果。该模型可以准确预测所有收集到的具有完整MA的ERV序列(总共125个)中的98.4%。该模型扫描了来自118个家族的94671条HERV序列,在人类染色体中预测了104个新的假定MA。还分析了假定MA的分布和模型参数的优化。我们的预测方法的应用也扩展到了其他逆转录病毒,并获得了令人满意的结果。