Zakharov Alexey V, Peach Megan L, Sitzmann Markus, Nicklaus Marc C
CADD Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health , DHHS, NCI-Frederick, 376 Boyles St., Frederick, Maryland 21702, United States.
J Chem Inf Model. 2014 Mar 24;54(3):705-12. doi: 10.1021/ci400737s. Epub 2014 Feb 28.
Many of the structures in PubChem are annotated with activities determined in high-throughput screening (HTS) assays. Because of the nature of these assays, the activity data are typically strongly imbalanced, with a small number of active compounds contrasting with a very large number of inactive compounds. We have used several such imbalanced PubChem HTS assays to test and develop strategies to efficiently build robust QSAR models from imbalanced data sets. Different descriptor types [Quantitative Neighborhoods of Atoms (QNA) and "biological" descriptors] were used to generate a variety of QSAR models in the program GUSAR. The models obtained were compared using external test and validation sets. We also report on our efforts to incorporate the most predictive of our models in the publicly available NCI/CADD Group Web services ( http://cactus.nci.nih.gov/chemical/apps/cap).
许多PubChem中的结构都标注有通过高通量筛选(HTS)测定确定的活性。由于这些测定的性质,活性数据通常严重不平衡,少数活性化合物与大量非活性化合物形成对比。我们使用了几个这样不平衡的PubChem HTS测定来测试和开发从不平衡数据集中有效构建稳健QSAR模型的策略。在GUSAR程序中使用了不同的描述符类型[原子定量邻域(QNA)和“生物学”描述符]来生成各种QSAR模型。使用外部测试集和验证集对获得的模型进行比较。我们还报告了我们将最具预测性的模型纳入公开可用的NCI/CADD Group网络服务(http://cactus.nci.nih.gov/chemical/apps/cap)的努力。