LIMES Program Unit, Chemical Biology and Medicinal Chemistry, Department of Life Science Informatics, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstr. 2, D-53113 Bonn, Germany.
J Chem Inf Model. 2013 Jul 22;53(7):1595-601. doi: 10.1021/ci4002712. Epub 2013 Jul 3.
The choice of negative training data for machine learning is a little explored issue in chemoinformatics. In this study, the influence of alternative sets of negative training data and different background databases on support vector machine (SVM) modeling and virtual screening has been investigated. Target-directed SVM models have been derived on the basis of differently composed training sets containing confirmed inactive molecules or randomly selected database compounds as negative training instances. These models were then applied to search background databases consisting of biological screening data or randomly assembled compounds for available hits. Negative training data were found to systematically influence compound recall in virtual screening. In addition, different background databases had a strong influence on the search results. Our findings also indicated that typical benchmark settings lead to an overestimation of SVM-based virtual screening performance compared to search conditions that are more relevant for practical applications.
在化学生物信息学中,机器学习的负训练数据选择是一个研究较少的问题。在这项研究中,我们研究了不同的负训练数据集和不同背景数据库对支持向量机(SVM)建模和虚拟筛选的影响。基于由不同组成的训练集,其中包含确证的非活性分子或随机选择的数据库化合物作为负训练实例,我们构建了基于靶标的 SVM 模型。然后,将这些模型应用于基于生物筛选数据或随机组装化合物的背景数据库搜索中,以寻找可用的命中化合物。我们发现负训练数据会系统地影响虚拟筛选中的化合物召回率。此外,不同的背景数据库对搜索结果也有很大的影响。我们的研究结果还表明,与更符合实际应用的搜索条件相比,典型的基准设置会导致基于 SVM 的虚拟筛选性能的高估。