探索最佳类别分布以增强植物病毒编码的RNA沉默抑制子的预测和特征描述。

Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors.

作者信息

Nath Abhigyan, Subbiah Karthikeyan

机构信息

Department of Computer Science, Banaras Hindu University, Varanasi, India.

出版信息

3 Biotech. 2016 Jun;6(1):93. doi: 10.1007/s13205-016-0410-1. Epub 2016 Mar 21.

DOI:10.1007/s13205-016-0410-1

PMID:28330163

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4801844/

Abstract

To counter the host RNA silencing defense mechanism, many plant viruses encode RNA silencing suppressor proteins. These groups of proteins share very low sequence and structural similarities among them, which consequently hamper their annotation using sequence similarity-based search methods. Alternatively the machine learning-based methods can become a suitable choice, but the optimal performance through machine learning-based methods is being affected by various factors such as class imbalance, incomplete learning, selection of inappropriate features, etc. In this paper, we have proposed a novel approach to deal with the class imbalance problem by finding the optimal class distribution for enhancing the prediction accuracy for the RNA silencing suppressors. The optimal class distribution was obtained using different resampling techniques with varying degrees of class distribution starting from natural distribution to ideal distribution, i.e., equal distribution. The experimental results support the fact that optimal class distribution plays an important role to achieve near perfect learning. The best prediction results are obtained with Sequential Minimal Optimization (SMO) learning algorithm. We could achieve a sensitivity of 98.5 %, specificity of 92.6 % with an overall accuracy of 95.3 % on a tenfold cross validation and is further validated using leave one out cross validation test. It was also observed that the machine learning models trained on oversampled training sets using synthetic minority oversampling technique (SMOTE) have relatively performed better than on both randomly undersampled and imbalanced training data sets. Further, we have characterized the important discriminatory sequence features of RNA-silencing suppressors which distinguish these groups of proteins from other protein families.

摘要

为了对抗宿主RNA沉默防御机制，许多植物病毒编码RNA沉默抑制蛋白。这些蛋白质组之间的序列和结构相似性非常低，因此妨碍了使用基于序列相似性的搜索方法对它们进行注释。机器学习方法可能成为合适的选择，但基于机器学习的方法的最佳性能受到各种因素的影响，如类别不平衡、学习不完整、选择不适当的特征等。在本文中，我们提出了一种新的方法来处理类别不平衡问题，即通过找到最优的类别分布来提高RNA沉默抑制子的预测准确性。通过使用不同的重采样技术获得最优类别分布，从自然分布到理想分布（即均匀分布），类别分布程度各不相同。实验结果支持了最优类别分布在实现近乎完美学习方面起着重要作用这一事实。使用序列最小优化（SMO）学习算法获得了最佳预测结果。在十折交叉验证中，我们实现了98.5%的灵敏度、92.6%的特异性和95.3%的总体准确率，并使用留一法交叉验证测试进一步验证。还观察到，使用合成少数类过采样技术（SMOTE）在过采样训练集上训练的机器学习模型比在随机欠采样和不平衡训练数据集上的表现相对更好。此外，我们还表征了RNA沉默抑制子的重要鉴别序列特征，这些特征将这些蛋白质组与其他蛋白质家族区分开来。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4666/4801844/fbee6f665ba4/13205_2016_410_Fig1_HTML.jpg

相似文献

Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors.探索最佳类别分布以增强植物病毒编码的RNA沉默抑制子的预测和特征描述。

3 Biotech. 2016 Jun;6(1):93. doi: 10.1007/s13205-016-0410-1. Epub 2016 Mar 21.

Enhanced Prediction and Characterization of CDK Inhibitors Using Optimal Class Distribution.利用最优类别分布增强细胞周期蛋白依赖性激酶（CDK）抑制剂的预测与表征

Interdiscip Sci. 2017 Jun;9(2):292-303. doi: 10.1007/s12539-016-0151-1. Epub 2016 Feb 15.

Unsupervised learning assisted robust prediction of bioluminescent proteins.无监督学习辅助的生物发光蛋白稳健预测

Comput Biol Med. 2016 Jan 1;68:27-36. doi: 10.1016/j.compbiomed.2015.10.013. Epub 2015 Nov 10.

Maximizing lipocalin prediction through balanced and diversified training set and decision fusion.通过平衡且多样化的训练集和决策融合实现脂蛋白预测最大化。

Comput Biol Chem. 2015 Dec;59 Pt A:101-10. doi: 10.1016/j.compbiolchem.2015.09.011. Epub 2015 Sep 28.

Supervised learning classification models for prediction of plant virus encoded RNA silencing suppressors.用于预测植物病毒编码的RNA沉默抑制子的监督学习分类模型。

PLoS One. 2014 May 14;9(5):e97446. doi: 10.1371/journal.pone.0097446. eCollection 2014.

Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.基于结构-活性关系的高度不平衡Tox21数据集的化学分类

J Cheminform. 2020 Oct 27;12(1):66. doi: 10.1186/s13321-020-00468-x.

Improvement of P300-Based Brain-Computer Interfaces for Home Appliances Control by Data Balancing Techniques.基于 P300 的脑机接口的数据均衡技术在家用电器控制中的改进。

Sensors (Basel). 2020 Sep 29;20(19):5576. doi: 10.3390/s20195576.

Efficient treatment of outliers and class imbalance for diabetes prediction.高效处理糖尿病预测中的异常值和类别不平衡问题。

Artif Intell Med. 2020 Apr;104:101815. doi: 10.1016/j.artmed.2020.101815. Epub 2020 Feb 10.

Identification of human drug targets using machine-learning algorithms.使用机器学习算法鉴定人类药物靶点。

Comput Biol Med. 2015 Jan;56:175-81. doi: 10.1016/j.compbiomed.2014.11.008. Epub 2014 Nov 20.

Classification of Imbalanced Data by Oversampling in Kernel Space of Support Vector Machines.支持向量机核空间中基于过采样的不平衡数据分类

IEEE Trans Neural Netw Learn Syst. 2018 Sep;29(9):4065-4076. doi: 10.1109/TNNLS.2017.2751612. Epub 2017 Oct 10.

引用本文的文献

Application of machine learning in understanding plant virus pathogenesis: trends and perspectives on emergence, diagnosis, host-virus interplay and management.机器学习在理解植物病毒发病机制中的应用：关于病毒出现、诊断、宿主 - 病毒相互作用及管理的趋势与展望

Virol J. 2022 Mar 9;19(1):42. doi: 10.1186/s12985-022-01767-5.

Machine Learning Assisted Prediction of Prognostic Biomarkers Associated With COVID-19, Using Clinical and Proteomics Data.利用临床和蛋白质组学数据，通过机器学习辅助预测与COVID-19相关的预后生物标志物。

Front Genet. 2021 May 20;12:636441. doi: 10.3389/fgene.2021.636441. eCollection 2021.

Application of machine learning for diagnostic prediction of root caries.机器学习在根面龋诊断预测中的应用。

Gerodontology. 2019 Dec;36(4):395-404. doi: 10.1111/ger.12432. Epub 2019 Jul 5.

本文引用的文献

Unsupervised learning assisted robust prediction of bioluminescent proteins.无监督学习辅助的生物发光蛋白稳健预测

Comput Biol Med. 2016 Jan 1;68:27-36. doi: 10.1016/j.compbiomed.2015.10.013. Epub 2015 Nov 10.

Maximizing lipocalin prediction through balanced and diversified training set and decision fusion.通过平衡且多样化的训练集和决策融合实现脂蛋白预测最大化。

Comput Biol Chem. 2015 Dec;59 Pt A:101-10. doi: 10.1016/j.compbiolchem.2015.09.011. Epub 2015 Sep 28.

Key importance of small RNA binding for the activity of a glycine-tryptophan (GW) motif-containing viral suppressor of RNA silencing.小RNA结合对于含甘氨酸-色氨酸（GW）基序的RNA沉默病毒抑制子活性的关键重要性。

J Biol Chem. 2015 Jan 30;290(5):3106-20. doi: 10.1074/jbc.M114.593707. Epub 2014 Dec 10.

Inferring biological basis about psychrophilicity by interpreting the rules generated from the correctly classified input instances by a classifier.通过解释分类器对正确分类的输入实例生成的规则来推断嗜冷性的生物学基础。

Comput Biol Chem. 2014 Dec;53PB:198-203. doi: 10.1016/j.compbiolchem.2014.10.002. Epub 2014 Oct 17.

Identification of human drug targets using machine-learning algorithms.使用机器学习算法鉴定人类药物靶点。

Comput Biol Med. 2015 Jan;56:175-81. doi: 10.1016/j.compbiomed.2014.11.008. Epub 2014 Nov 20.

Resampling methods improve the predictive power of modeling in class-imbalanced datasets.重采样方法提高了类不平衡数据集中建模的预测能力。

Int J Environ Res Public Health. 2014 Sep 18;11(9):9776-89. doi: 10.3390/ijerph110909776.

Replication-associated proteins encoded by Wheat dwarf virus act as RNA silencing suppressors.小麦矮缩病毒编码的复制相关蛋白可作为RNA沉默抑制子。

Virus Res. 2014 Sep 22;190:34-9. doi: 10.1016/j.virusres.2014.06.014. Epub 2014 Jul 9.

Prediction of membrane transport proteins and their substrate specificities using primary sequence information.利用一级序列信息预测膜转运蛋白及其底物特异性。

PLoS One. 2014 Jun 26;9(6):e100278. doi: 10.1371/journal.pone.0100278. eCollection 2014.

Identification of an RNA silencing suppressor encoded by a mastrevirus.一种玉米线条病毒编码的RNA沉默抑制子的鉴定

J Gen Virol. 2014 Sep;95(Pt 9):2082-2088. doi: 10.1099/vir.0.064246-0. Epub 2014 May 27.

Supervised learning classification models for prediction of plant virus encoded RNA silencing suppressors.用于预测植物病毒编码的RNA沉默抑制子的监督学习分类模型。

PLoS One. 2014 May 14;9(5):e97446. doi: 10.1371/journal.pone.0097446. eCollection 2014.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

探索最佳类别分布以增强植物病毒编码的RNA沉默抑制子的预测和特征描述。

Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献