College of Chemistry, Sichuan University, Chengdu 610064, People's Republic of China.
J Comput Chem. 2010 Mar;31(4):752-63. doi: 10.1002/jcc.21347.
Small molecule aggregators non-specifically inhibit multiple unrelated proteins, rendering them therapeutically useless. They frequently appear as false hits and thus need to be eliminated in high-throughput screening campaigns. Computational methods have been explored for identifying aggregators, which have not been tested in screening large compound libraries. We used 1319 aggregators and 128,325 non-aggregators to develop a support vector machines (SVM) aggregator identification model, which was tested by four methods. The first is five fold cross-validation, which showed comparable aggregator and significantly improved non-aggregator identification rates against earlier studies. The second is the independent test of 17 aggregators discovered independently from the training aggregators, 71% of which were correctly identified. The third is retrospective screening of 13M PUBCHEM and 168K MDDR compounds, which predicted 97.9% and 98.7% of the PUBCHEM and MDDR compounds as non-aggregators. The fourth is retrospective screening of 5527 MDDR compounds similar to the known aggregators, 1.14% of which were predicted as aggregators. SVM showed slightly better overall performance against two other machine learning methods based on five fold cross-validation studies of the same settings. Molecular features of aggregation, extracted by a feature selection method, are consistent with published profiles. SVM showed substantial capability in identifying aggregators from large libraries at low false-hit rates.
小分子聚集物非特异性地抑制多种不相关的蛋白质,使其在治疗上变得无用。它们经常作为假阳性出现,因此需要在高通量筛选中消除。已经探索了计算方法来识别聚集物,但这些方法尚未在筛选大型化合物库中进行测试。我们使用了 1319 种聚集物和 128325 种非聚集物来开发支持向量机(SVM)聚集物识别模型,该模型通过四种方法进行了测试。第一种是五重交叉验证,与早期研究相比,该方法显示出可比的聚集物和显著提高的非聚集物识别率。第二种是对从训练聚集物中独立发现的 17 种聚集物的独立测试,其中 71%被正确识别。第三种是对 13M PUBCHEM 和 168K MDDR 化合物的回溯筛选,预测了 97.9%和 98.7%的 PUBCHEM 和 MDDR 化合物是非聚集物。第四种是对与已知聚集物相似的 5527 种 MDDR 化合物的回溯筛选,其中 1.14%被预测为聚集物。SVM 在五重交叉验证研究中对两种其他机器学习方法的整体性能略好,这些研究具有相同的设置。通过特征选择方法提取的聚集分子特征与已发表的特征一致。SVM 在以低假阳性率从大型文库中识别聚集物方面具有很强的能力。