Rodríguez-Pérez Raquel, Vogt Martin, Bajorath Jürgen
Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität , Dahlmannstrasse 2, D-53113 Bonn, Germany.
J Chem Inf Model. 2017 Apr 24;57(4):710-716. doi: 10.1021/acs.jcim.7b00088. Epub 2017 Apr 10.
Support vector machine (SVM) modeling is one of the most popular machine learning approaches in chemoinformatics and drug design. The influence of training set composition and size on predictions currently is an underinvestigated issue in SVM modeling. In this study, we have derived SVM classification and ranking models for a variety of compound activity classes under systematic variation of the number of positive and negative training examples. With increasing numbers of negative training compounds, SVM classification calculations became increasingly accurate and stable. However, this was only the case if a required threshold of positive training examples was also reached. In addition, consideration of class weights and optimization of cost factors substantially aided in balancing the calculations for increasing numbers of negative training examples. Taken together, the results of our analysis have practical implications for SVM learning and the prediction of active compounds. For all compound classes under study, top recall performance and independence of compound recall of training set composition was achieved when 250-500 active and 500-1000 randomly selected inactive training instances were used. However, as long as ∼50 known active compounds were available for training, increasing numbers of 500-1000 randomly selected negative training examples significantly improved model performance and gave very similar results for different training sets.
支持向量机(SVM)建模是化学信息学和药物设计中最流行的机器学习方法之一。目前,训练集组成和大小对预测的影响在SVM建模中是一个研究不足的问题。在本研究中,我们在正负训练示例数量的系统变化下,针对多种化合物活性类别推导了SVM分类和排序模型。随着负训练化合物数量的增加,SVM分类计算变得越来越准确和稳定。然而,只有在达到正训练示例的所需阈值时才会如此。此外,考虑类别权重和优化成本因素在很大程度上有助于平衡针对不断增加的负训练示例的计算。综合来看,我们的分析结果对SVM学习和活性化合物的预测具有实际意义。对于所有研究的化合物类别,当使用250 - 500个活性和500 - 1000个随机选择的非活性训练实例时,实现了最高召回性能以及训练集组成的化合物召回独立性。然而,只要有大约50个已知活性化合物可用于训练,增加500 - 1000个随机选择的负训练示例会显著提高模型性能,并且不同训练集的结果非常相似。