不同训练集组成和大小对基于支持向量机的活性化合物预测的影响。

Influence of Varying Training Set Composition and Size on Support Vector Machine-Based Prediction of Active Compounds.

作者信息

Rodríguez-Pérez Raquel, Vogt Martin, Bajorath Jürgen

机构信息

Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität , Dahlmannstrasse 2, D-53113 Bonn, Germany.

出版信息

J Chem Inf Model. 2017 Apr 24;57(4):710-716. doi: 10.1021/acs.jcim.7b00088. Epub 2017 Apr 10.

DOI:10.1021/acs.jcim.7b00088

PMID:28376613

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5417594/

Abstract

Support vector machine (SVM) modeling is one of the most popular machine learning approaches in chemoinformatics and drug design. The influence of training set composition and size on predictions currently is an underinvestigated issue in SVM modeling. In this study, we have derived SVM classification and ranking models for a variety of compound activity classes under systematic variation of the number of positive and negative training examples. With increasing numbers of negative training compounds, SVM classification calculations became increasingly accurate and stable. However, this was only the case if a required threshold of positive training examples was also reached. In addition, consideration of class weights and optimization of cost factors substantially aided in balancing the calculations for increasing numbers of negative training examples. Taken together, the results of our analysis have practical implications for SVM learning and the prediction of active compounds. For all compound classes under study, top recall performance and independence of compound recall of training set composition was achieved when 250-500 active and 500-1000 randomly selected inactive training instances were used. However, as long as ∼50 known active compounds were available for training, increasing numbers of 500-1000 randomly selected negative training examples significantly improved model performance and gave very similar results for different training sets.

摘要

支持向量机（SVM）建模是化学信息学和药物设计中最流行的机器学习方法之一。目前，训练集组成和大小对预测的影响在SVM建模中是一个研究不足的问题。在本研究中，我们在正负训练示例数量的系统变化下，针对多种化合物活性类别推导了SVM分类和排序模型。随着负训练化合物数量的增加，SVM分类计算变得越来越准确和稳定。然而，只有在达到正训练示例的所需阈值时才会如此。此外，考虑类别权重和优化成本因素在很大程度上有助于平衡针对不断增加的负训练示例的计算。综合来看，我们的分析结果对SVM学习和活性化合物的预测具有实际意义。对于所有研究的化合物类别，当使用250 - 500个活性和500 - 1000个随机选择的非活性训练实例时，实现了最高召回性能以及训练集组成的化合物召回独立性。然而，只要有大约50个已知活性化合物可用于训练，增加500 - 1000个随机选择的负训练示例会显著提高模型性能，并且不同训练集的结果非常相似。

相似文献

Influence of Varying Training Set Composition and Size on Support Vector Machine-Based Prediction of Active Compounds.

J Chem Inf Model. 2017 Apr 24;57(4):710-716. doi: 10.1021/acs.jcim.7b00088. Epub 2017 Apr 10.

Comparison of confirmed inactive and randomly selected compounds as negative training examples in support vector machine-based virtual screening.

J Chem Inf Model. 2013 Jul 22;53(7):1595-601. doi: 10.1021/ci4002712. Epub 2013 Jul 3.

Determination of Meta-Parameters for Support Vector Machine Linear Combinations.

Mol Inform. 2015 Feb;34(2-3):127-33. doi: 10.1002/minf.201400163. Epub 2015 Feb 17.

Improving autocoding performance of rare categories in injury classification: Is more training data or filtering the solution?

Accid Anal Prev. 2018 Jan;110:115-127. doi: 10.1016/j.aap.2017.10.020. Epub 2017 Nov 8.

Evolution of Support Vector Machine and Regression Modeling in Chemoinformatics and Drug Discovery.

J Comput Aided Mol Des. 2022 May;36(5):355-362. doi: 10.1007/s10822-022-00442-9. Epub 2022 Mar 19.

ADMET Evaluation in Drug Discovery. 16. Predicting hERG Blockers by Combining Multiple Pharmacophores and Machine Learning Approaches.

Mol Pharm. 2016 Aug 1;13(8):2855-66. doi: 10.1021/acs.molpharmaceut.6b00471. Epub 2016 Jul 18.

Support Vector Machine Classification and Regression Prioritize Different Structural Features for Binary Compound Activity and Potency Value Prediction.

ACS Omega. 2017 Oct 31;2(10):6371-6379. doi: 10.1021/acsomega.7b01079. Epub 2017 Oct 4.

Exploring Alternative Strategies for the Identification of Potent Compounds Using Support Vector Machine and Regression Modeling.

J Chem Inf Model. 2019 Mar 25;59(3):983-992. doi: 10.1021/acs.jcim.8b00584. Epub 2018 Dec 14.

Effectively Identifying Compound-Protein Interactions by Learning from Positive and Unlabeled Examples.

IEEE/ACM Trans Comput Biol Bioinform. 2018 Nov-Dec;15(6):1832-1843. doi: 10.1109/TCBB.2016.2570211. Epub 2016 May 18.

Evaluation of different virtual screening strategies on the basis of compound sets with characteristic core distributions and dissimilarity relationships.

J Comput Aided Mol Des. 2019 Aug;33(8):729-743. doi: 10.1007/s10822-019-00218-8. Epub 2019 Aug 21.

引用本文的文献

Altered brain regional homogeneity, depressive symptoms, and cognitive impairments in medication-free female patients with current depressive episodes in bipolar disorder and major depressive disorder.

BMC Psychiatry. 2024 Dec 6;24(1):892. doi: 10.1186/s12888-024-06352-4.

Remote Sensing of Nitric Acid and Temperature via Design of Experiments, Chemometrics, and Raman Spectroscopy.

ACS Omega. 2024 Oct 31;9(45):45600-45609. doi: 10.1021/acsomega.4c08219. eCollection 2024 Nov 12.

Insights into Tetravalent Np Speciation in HNO through Spectroelectrochemistry and Multivariate Analysis.

ACS Omega. 2024 Oct 16;9(43):43547-43556. doi: 10.1021/acsomega.4c05464. eCollection 2024 Oct 29.

Evaluating point-prediction uncertainties in neural networks for protein-ligand binding prediction.

Artif Intell Chem. 2023 Jun;1(1). doi: 10.1016/j.aichem.2023.100004. Epub 2023 Jun 3.

Toward Quantitative Models in Safety Assessment: A Case Study to Show Impact of Dose-Response Inference on hERG Inhibition Models.

Int J Mol Sci. 2022 Dec 30;24(1):635. doi: 10.3390/ijms24010635.

PPI-Affinity: A Web Tool for the Prediction and Optimization of Protein-Peptide and Protein-Protein Binding Affinity.

J Proteome Res. 2022 Aug 5;21(8):1829-1841. doi: 10.1021/acs.jproteome.2c00020. Epub 2022 Jun 2.

Evolution of Support Vector Machine and Regression Modeling in Chemoinformatics and Drug Discovery.

J Comput Aided Mol Des. 2022 May;36(5):355-362. doi: 10.1007/s10822-022-00442-9. Epub 2022 Mar 19.

Prediction of Compound Profiling Matrices, Part II: Relative Performance of Multitask Deep Learning and Random Forest Classification on the Basis of Varying Amounts of Training Data.

ACS Omega. 2018 Sep 30;3(9):12033-12040. doi: 10.1021/acsomega.8b01682. Epub 2018 Sep 27.

Prediction of Compound Profiling Matrices Using Machine Learning.

ACS Omega. 2018 Apr 30;3(4):4713-4723. doi: 10.1021/acsomega.8b00462.

Support Vector Machine Classification and Regression Prioritize Different Structural Features for Binary Compound Activity and Potency Value Prediction.

ACS Omega. 2017 Oct 31;2(10):6371-6379. doi: 10.1021/acsomega.7b01079. Epub 2017 Oct 4.

本文引用的文献

Ligand-based target prediction with signature fingerprints.

J Chem Inf Model. 2014 Oct 27;54(10):2647-53. doi: 10.1021/ci500361u. Epub 2014 Oct 3.

The influence of negative training set size on machine learning-based virtual screening.

J Cheminform. 2014 Jun 11;6:32. doi: 10.1186/1758-2946-6-32. eCollection 2014.

Support vector machines for drug discovery.

Expert Opin Drug Discov. 2014 Jan;9(1):93-104. doi: 10.1517/17460441.2014.866943. Epub 2013 Dec 5.

The ChEMBL bioactivity database: an update.

Nucleic Acids Res. 2014 Jan;42(Database issue):D1083-90. doi: 10.1093/nar/gkt1031. Epub 2013 Nov 7.

Comparison of confirmed inactive and randomly selected compounds as negative training examples in support vector machine-based virtual screening.

J Chem Inf Model. 2013 Jul 22;53(7):1595-601. doi: 10.1021/ci4002712. Epub 2013 Jul 3.

The influence of the inactives subset generation on the performance of machine learning methods.

J Cheminform. 2013 Apr 5;5(1):17. doi: 10.1186/1758-2946-5-17.

ZINC: a free tool to discover chemistry for biology.

J Chem Inf Model. 2012 Jul 23;52(7):1757-68. doi: 10.1021/ci3001277. Epub 2012 Jun 15.

Extended-connectivity fingerprints.

J Chem Inf Model. 2010 May 24;50(5):742-54. doi: 10.1021/ci100050t.

Current trends in ligand-based virtual screening: molecular representations, data mining methods, new application areas, and performance evaluation.

J Chem Inf Model. 2010 Feb 22;50(2):205-16. doi: 10.1021/ci900419k.

Support-vector-machine-based ranking significantly improves the effectiveness of similarity searching using 2D fingerprints and multiple reference compounds.

J Chem Inf Model. 2008 Apr;48(4):742-6. doi: 10.1021/ci700461s. Epub 2008 Mar 5.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

不同训练集组成和大小对基于支持向量机的活性化合物预测的影响。

Influence of Varying Training Set Composition and Size on Support Vector Machine-Based Prediction of Active Compounds.

作者信息

Rodríguez-Pérez Raquel, Vogt Martin, Bajorath Jürgen

机构信息

Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität , Dahlmannstrasse 2, D-53113 Bonn, Germany.

出版信息

J Chem Inf Model. 2017 Apr 24;57(4):710-716. doi: 10.1021/acs.jcim.7b00088. Epub 2017 Apr 10.

DOI:10.1021/acs.jcim.7b00088

PMID:28376613

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5417594/

Abstract

摘要

不同训练集组成和大小对基于支持向量机的活性化合物预测的影响。

Influence of Varying Training Set Composition and Size on Support Vector Machine-Based Prediction of Active Compounds.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

不同训练集组成和大小对基于支持向量机的活性化合物预测的影响。

Influence of Varying Training Set Composition and Size on Support Vector Machine-Based Prediction of Active Compounds.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献