Ebrahimi Mansour, Aghagolzadeh Parisa, Shamabadi Narges, Tahmasebi Ahmad, Alsharifi Mohammed, Adelson David L, Hemmatzadeh Farhid, Ebrahimie Esmaeil
Department of Biology, School of Basic Sciences, University of Qom, Qom, Iran.
Department of Nephrology, Hypertension, and Clinical Pharmacology, University of Bern, Bern, Switzerland.
PLoS One. 2014 May 8;9(5):e96984. doi: 10.1371/journal.pone.0096984. eCollection 2014.
The evolution of the influenza A virus to increase its host range is a major concern worldwide. Molecular mechanisms of increasing host range are largely unknown. Influenza surface proteins play determining roles in reorganization of host-sialic acid receptors and host range. In an attempt to uncover the physic-chemical attributes which govern HA subtyping, we performed a large scale functional analysis of over 7000 sequences of 16 different HA subtypes. Large number (896) of physic-chemical protein characteristics were calculated for each HA sequence. Then, 10 different attribute weighting algorithms were used to find the key characteristics distinguishing HA subtypes. Furthermore, to discover machine leaning models which can predict HA subtypes, various Decision Tree, Support Vector Machine, Naïve Bayes, and Neural Network models were trained on calculated protein characteristics dataset as well as 10 trimmed datasets generated by attribute weighting algorithms. The prediction accuracies of the machine learning methods were evaluated by 10-fold cross validation. The results highlighted the frequency of Gln (selected by 80% of attribute weighting algorithms), percentage/frequency of Tyr, percentage of Cys, and frequencies of Try and Glu (selected by 70% of attribute weighting algorithms) as the key features that are associated with HA subtyping. Random Forest tree induction algorithm and RBF kernel function of SVM (scaled by grid search) showed high accuracy of 98% in clustering and predicting HA subtypes based on protein attributes. Decision tree models were successful in monitoring the short mutation/reassortment paths by which influenza virus can gain the key protein structure of another HA subtype and increase its host range in a short period of time with less energy consumption. Extracting and mining a large number of amino acid attributes of HA subtypes of influenza A virus through supervised algorithms represent a new avenue for understanding and predicting possible future structure of influenza pandemics.
甲型流感病毒扩大其宿主范围的进化是全球主要关注的问题。扩大宿主范围的分子机制在很大程度上尚不清楚。流感病毒表面蛋白在宿主唾液酸受体重组和宿主范围方面起着决定性作用。为了揭示决定血凝素(HA)亚型的物理化学属性,我们对16种不同HA亚型的7000多个序列进行了大规模功能分析。为每个HA序列计算了大量(896个)物理化学蛋白质特征。然后,使用10种不同的属性加权算法来找出区分HA亚型的关键特征。此外,为了发现能够预测HA亚型的机器学习模型,在计算出的蛋白质特征数据集以及由属性加权算法生成的10个精简数据集上训练了各种决策树、支持向量机、朴素贝叶斯和神经网络模型。通过10倍交叉验证评估机器学习方法的预测准确性。结果突出显示,谷氨酰胺的频率(80%的属性加权算法选择)、酪氨酸的百分比/频率、半胱氨酸的百分比以及色氨酸和谷氨酸的频率(70%的属性加权算法选择)是与HA亚型相关的关键特征。随机森林树归纳算法和支持向量机的径向基函数核(通过网格搜索进行缩放)在基于蛋白质属性对HA亚型进行聚类和预测时显示出98%的高精度。决策树模型成功地监测了流感病毒能够获得另一种HA亚型关键蛋白质结构并在短时间内以较少能量消耗扩大其宿主范围的短突变/重配路径。通过监督算法提取和挖掘甲型流感病毒HA亚型的大量氨基酸属性,为理解和预测未来流感大流行可能的结构提供了一条新途径。