Swamidass S Joshua, Azencott Chloé-Agathe, Lin Ting-Wan, Gramajo Hugo, Tsai Shiou-Chuan, Baldi Pierre
School of Information and Computer Sciences, Institute for Genomics and Bioinformatics, University of California, Irvine, Irvine, California 92697-3435, USA.
J Chem Inf Model. 2009 Apr;49(4):756-66. doi: 10.1021/ci8004379.
Given activity training data from high-throughput screening (HTS) experiments, virtual high-throughput screening (vHTS) methods aim to predict in silico the activity of untested chemicals. We present a novel method, the Influence Relevance Voter (IRV), specifically tailored for the vHTS task. The IRV is a low-parameter neural network which refines a k-nearest neighbor classifier by nonlinearly combining the influences of a chemical's neighbors in the training set. Influences are decomposed, also nonlinearly, into a relevance component and a vote component. The IRV is benchmarked using the data and rules of two large, open, competitions, and its performance compared to the performance of other participating methods, as well as of an in-house support vector machine (SVM) method. On these benchmark data sets, IRV achieves state-of-the-art results, comparable to the SVM in one case, and significantly better than the SVM in the other, retrieving three times as many actives in the top 1% of its prediction-sorted list. The IRV presents several other important advantages over SVMs and other methods: (1) the output predictions have a probabilistic semantic; (2) the underlying inferences are interpretable; (3) the training time is very short, on the order of minutes even for very large data sets; (4) the risk of overfitting is minimal, due to the small number of free parameters; and (5) additional information can easily be incorporated into the IRV architecture. Combined with its performance, these qualities make the IRV particularly well suited for vHTS.
给定来自高通量筛选(HTS)实验的活性训练数据,虚拟高通量筛选(vHTS)方法旨在通过计算机模拟预测未测试化学物质的活性。我们提出了一种新颖的方法——影响相关性投票器(IRV),它是专门为vHTS任务量身定制的。IRV是一种低参数神经网络,它通过非线性组合训练集中化学物质邻居的影响来优化k近邻分类器。影响也被非线性地分解为相关性分量和投票分量。使用两个大型公开竞赛的数据和规则对IRV进行基准测试,并将其性能与其他参与方法以及内部支持向量机(SVM)方法的性能进行比较。在这些基准数据集上,IRV取得了领先的结果,在一种情况下与SVM相当,在另一种情况下明显优于SVM,在其预测排序列表的前1%中检索到的活性物质数量是SVM的三倍。与SVM和其他方法相比,IRV还有其他几个重要优势:(1)输出预测具有概率语义;(2)潜在推理是可解释的;(3)训练时间非常短,即使对于非常大的数据集也只需几分钟;(4)由于自由参数数量少,过拟合风险最小;(5)可以轻松地将额外信息纳入IRV架构。结合其性能,这些特性使IRV特别适合vHTS。