Faculty of Mathematics and Computer Science , Jagiellonian University , 6 Łojasiewicza Street , 30-348 Kraków , Poland.
Department of Technology and Biotechnology of Drugs , Jagiellonian University Medical College , 9 Medyczna Street , 30-688 Kraków , Poland.
J Chem Inf Model. 2019 Dec 23;59(12):4974-4992. doi: 10.1021/acs.jcim.9b00689. Epub 2019 Nov 22.
New computational approaches for virtual screening applications are constantly being developed. However, before a particular tool is used to search for new active compounds, its effectiveness in the type of task must be examined. In this study, we conducted a detailed analysis of various aspects of preparation of respective data sets for such an evaluation. We propose a protocol for fetching data from the ChEMBL database, examine various compound representations in terms of the possible bias resulting from the way they are generated, and define a new metric for comparing the structural similarity of compounds, which is in line with chemical intuition. The newly developed method is also used for the evaluation of various approaches for division of the data set into training and test set parts, which are also examined in detail in terms of being the source of possible results bias. Finally, machine learning methods are applied in cross-validation studies of data sets constructed within the paper, constituting benchmarks for the assessment of computational methods developed for virtual screening tasks. Additionally, analogous data sets for class A G protein-coupled receptors (100 targets with the highest number of records) were prepared. They are available at http://gmum.net/benchmarks/ , together with script enabling reproduction of all results available at https://github.com/lesniak43/ananas .
新的计算方法不断被开发用于虚拟筛选应用。然而,在使用特定工具搜索新的活性化合物之前,必须检查其在特定任务类型中的有效性。在这项研究中,我们详细分析了为这种评估准备相应数据集的各个方面。我们提出了一种从 ChEMBL 数据库获取数据的方案,检查了各种化合物表示形式,以评估它们生成方式可能导致的偏差,并定义了一种新的化合物结构相似性比较度量标准,该标准符合化学直觉。新开发的方法还用于评估数据集分为训练集和测试集部分的各种方法,也详细检查了它们作为可能结果偏差来源的情况。最后,在本文构建的数据集的交叉验证研究中应用了机器学习方法,为用于虚拟筛选任务的计算方法的评估构成了基准。此外,还准备了类似的 A 类 G 蛋白偶联受体数据集(100 个具有最高记录数的靶标)。它们可在 http://gmum.net/benchmarks/ 上获得,并且可以使用脚本重现 https://github.com/lesniak43/ananas 上提供的所有结果。