Brain and Mind Centre, The Lambert Initiative for Cannabinoid Therapeutics, The University of Sydney, Sydney, New South Wales 2006, Australia.
J Chem Inf Model. 2020 Oct 26;60(10):4536-4545. doi: 10.1021/acs.jcim.0c00469. Epub 2020 Oct 7.
Ligand-based virtual screening is a useful tool for drug and probe discovery due to its high accessibility and scalability. The recent identification of bias in many data sets that were used in performance evaluation, quantified by the asymmetric validation embedding (AVE) score, has prompted the reanalysis of models to determine which performs best. Based on the understanding that ligand data are made up of blocks of highly correlated instances, we introduce a technique that quickly generates splits with AVE distributed close to zero using a combination of clustering and removal of the most biased data. We used our technique to compare the performance of the Morgan and CATS fingerprints and show that, after debiasing, the implementation of the CATS fingerprint performs significantly better. The code to replicate these results and perform low-bias splits is available at https://github.com/ljmartin/fp_low_ave.
基于配体的虚拟筛选是一种用于药物和探针发现的有用工具,因为它具有高可及性和可扩展性。最近发现,在许多用于性能评估的数据集中存在偏差,这可以通过不对称验证嵌入(AVE)得分来量化,这促使人们重新分析模型以确定哪种模型表现最佳。基于对配体数据由高度相关实例块组成的理解,我们引入了一种技术,该技术使用聚类和去除最有偏差的数据的组合,快速生成 AVE 分布接近零的拆分。我们使用该技术比较了 Morgan 和 CATS 指纹的性能,并表明在去偏后,CATS 指纹的实现性能显著提高。可在 https://github.com/ljmartin/fp_low_ave 上复制这些结果并执行低偏差拆分的代码。