Ma X H, Wang R, Yang S Y, Li Z R, Xue Y, Wei Y C, Low B C, Chen Y Z
Centre for Computational Science and Engineering, National University of Singapore, Singapore.
J Chem Inf Model. 2008 Jun;48(6):1227-37. doi: 10.1021/ci800022e. Epub 2008 Jun 6.
Virtual screening performance of support vector machines (SVM) depends on the diversity of training active and inactive compounds. While diverse inactive compounds can be routinely generated, the number and diversity of known actives are typically low. We evaluated the performance of SVM trained by sparsely distributed actives in six MDDR biological target classes composed of a high number of known actives (983-1645) of high, intermediate, and low structural diversity (muscarinic M1 receptor agonists, NMDA receptor antagonists, thrombin inhibitors, HIV protease inhibitors, cephalosporins, and renin inhibitors). SVM trained by regularly sparse data sets of 100 actives show improved yields at substantially reduced false-hit rates compared to those of published studies and those of Tanimoto-based similarity searching method based on the same data sets and molecular descriptors. SVM trained by very sparse data sets of 40 actives (2.4%-4.1% of the known actives) predicted 17.5-39.5%, 23.0-48.1%, and 70.2-92.4% of the remaining 943-1605 actives in the high, intermediate, and low diversity classes, respectively, 13.8-68.7% of which are outside the training compound families. SVM predicted 99.97% and 97.1% of the 9.997 M PUBCHEM and 167K remaining MDDR compounds as inactive and 2.6%-8.3% of the 19,495-38,483 MDDR compounds similar to the known actives as active. These suggest that SVM has substantial capability in identifying novel active compounds from sparse active data sets at low false-hit rates.
支持向量机(SVM)的虚拟筛选性能取决于训练用活性和非活性化合物的多样性。虽然非活性化合物可以常规生成,但已知活性化合物的数量和多样性通常较低。我们评估了在六个MDDR生物靶标类别中,由稀疏分布的活性化合物训练的SVM的性能,这些类别包含大量高、中、低结构多样性的已知活性化合物(983 - 1645种)(毒蕈碱M1受体激动剂、NMDA受体拮抗剂、凝血酶抑制剂、HIV蛋白酶抑制剂、头孢菌素和肾素抑制剂)。与已发表的研究以及基于相同数据集和分子描述符的基于Tanimoto相似性搜索方法相比,由100种活性化合物的规则稀疏数据集训练的SVM在显著降低假阳性率的情况下,产率有所提高。由40种活性化合物(已知活性化合物的2.4% - 4.1%)的非常稀疏数据集训练的SVM,分别预测了高、中、低多样性类别中其余943 - 1605种活性化合物的17.5% - 39.5%、23.0% - 48.1%和70.2% - 92.4%,其中13.8% - 68.7%不在训练化合物家族中。SVM将99.97%的999.7万个PUBCHEM化合物和97.1%的其余MDDR化合物预测为非活性,将与已知活性化合物相似的19495 - 38483个MDDR化合物中的2.6% - 8.3%预测为活性。这些结果表明,SVM在从稀疏活性数据集中以低假阳性率识别新型活性化合物方面具有强大能力。