Xia Jie, Tilahun Ermias Lemma, Reid Terry-Elinor, Zhang Liangren, Wang Xiang Simon
State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University, Beijing 100191, PR China; Molecular Modeling and Drug Discovery Core for District of Columbia Developmental Center for AIDS Research (DC D-CFAR), Laboratory of Cheminformatics and Drug Design, Department of Pharmaceutical Sciences, College of Pharmacy, Howard University, Washington, DC 20059, USA.
Molecular Modeling and Drug Discovery Core for District of Columbia Developmental Center for AIDS Research (DC D-CFAR), Laboratory of Cheminformatics and Drug Design, Department of Pharmaceutical Sciences, College of Pharmacy, Howard University, Washington, DC 20059, USA.
Methods. 2015 Jan;71:146-57. doi: 10.1016/j.ymeth.2014.11.015. Epub 2014 Dec 3.
Retrospective small-scale virtual screening (VS) based on benchmarking data sets has been widely used to estimate ligand enrichments of VS approaches in the prospective (i.e. real-world) efforts. However, the intrinsic differences of benchmarking sets to the real screening chemical libraries can cause biased assessment. Herein, we summarize the history of benchmarking methods as well as data sets and highlight three main types of biases found in benchmarking sets, i.e. "analogue bias", "artificial enrichment" and "false negative". In addition, we introduce our recent algorithm to build maximum-unbiased benchmarking sets applicable to both ligand-based and structure-based VS approaches, and its implementations to three important human histone deacetylases (HDACs) isoforms, i.e. HDAC1, HDAC6 and HDAC8. The leave-one-out cross-validation (LOO CV) demonstrates that the benchmarking sets built by our algorithm are maximum-unbiased as measured by property matching, ROC curves and AUCs.
基于基准数据集的回顾性小规模虚拟筛选(VS)已被广泛用于评估前瞻性(即实际应用)中VS方法的配体富集情况。然而,基准数据集与实际筛选化学文库之间的内在差异可能导致评估出现偏差。在此,我们总结了基准方法以及数据集的历史,并强调了在基准数据集中发现的三种主要偏差类型,即“类似物偏差”、“人为富集”和“假阴性”。此外,我们介绍了我们最近开发的算法,该算法可构建适用于基于配体和基于结构的VS方法的最大无偏基准数据集,并将其应用于三种重要的人类组蛋白去乙酰化酶(HDAC)亚型,即HDAC1、HDAC6和HDAC8。留一法交叉验证(LOO CV)表明,通过我们的算法构建的基准数据集在通过性质匹配、ROC曲线和AUC测量时是最大无偏的。