Nath Abhigyan, Subbiah Karthikeyan
Department of Computer Science, Banaras Hindu University, Varanasi, India.
3 Biotech. 2016 Jun;6(1):93. doi: 10.1007/s13205-016-0410-1. Epub 2016 Mar 21.
To counter the host RNA silencing defense mechanism, many plant viruses encode RNA silencing suppressor proteins. These groups of proteins share very low sequence and structural similarities among them, which consequently hamper their annotation using sequence similarity-based search methods. Alternatively the machine learning-based methods can become a suitable choice, but the optimal performance through machine learning-based methods is being affected by various factors such as class imbalance, incomplete learning, selection of inappropriate features, etc. In this paper, we have proposed a novel approach to deal with the class imbalance problem by finding the optimal class distribution for enhancing the prediction accuracy for the RNA silencing suppressors. The optimal class distribution was obtained using different resampling techniques with varying degrees of class distribution starting from natural distribution to ideal distribution, i.e., equal distribution. The experimental results support the fact that optimal class distribution plays an important role to achieve near perfect learning. The best prediction results are obtained with Sequential Minimal Optimization (SMO) learning algorithm. We could achieve a sensitivity of 98.5 %, specificity of 92.6 % with an overall accuracy of 95.3 % on a tenfold cross validation and is further validated using leave one out cross validation test. It was also observed that the machine learning models trained on oversampled training sets using synthetic minority oversampling technique (SMOTE) have relatively performed better than on both randomly undersampled and imbalanced training data sets. Further, we have characterized the important discriminatory sequence features of RNA-silencing suppressors which distinguish these groups of proteins from other protein families.
为了对抗宿主RNA沉默防御机制,许多植物病毒编码RNA沉默抑制蛋白。这些蛋白质组之间的序列和结构相似性非常低,因此妨碍了使用基于序列相似性的搜索方法对它们进行注释。机器学习方法可能成为合适的选择,但基于机器学习的方法的最佳性能受到各种因素的影响,如类别不平衡、学习不完整、选择不适当的特征等。在本文中,我们提出了一种新的方法来处理类别不平衡问题,即通过找到最优的类别分布来提高RNA沉默抑制子的预测准确性。通过使用不同的重采样技术获得最优类别分布,从自然分布到理想分布(即均匀分布),类别分布程度各不相同。实验结果支持了最优类别分布在实现近乎完美学习方面起着重要作用这一事实。使用序列最小优化(SMO)学习算法获得了最佳预测结果。在十折交叉验证中,我们实现了98.5%的灵敏度、92.6%的特异性和95.3%的总体准确率,并使用留一法交叉验证测试进一步验证。还观察到,使用合成少数类过采样技术(SMOTE)在过采样训练集上训练的机器学习模型比在随机欠采样和不平衡训练数据集上的表现相对更好。此外,我们还表征了RNA沉默抑制子的重要鉴别序列特征,这些特征将这些蛋白质组与其他蛋白质家族区分开来。