Zang Qingda, Rotroff Daniel M, Judson Richard S
ORISE Postdoctoral Fellow and ‡National Center for Computational Toxicology, U.S. Environmental Protection Agency , Research Triangle Park, North Carolina 27711, United States.
J Chem Inf Model. 2013 Dec 23;53(12):3244-61. doi: 10.1021/ci400527b. Epub 2013 Dec 11.
There are thousands of environmental chemicals subject to regulatory decisions for endocrine disrupting potential. The ToxCast and Tox21 programs have tested ∼8200 chemicals in a broad screening panel of in vitro high-throughput screening (HTS) assays for estrogen receptor (ER) agonist and antagonist activity. The present work uses this large data set to develop in silico quantitative structure-activity relationship (QSAR) models using machine learning (ML) methods and a novel approach to manage the imbalanced data distribution. Training compounds from the ToxCast project were categorized as active or inactive (binding or nonbinding) classes based on a composite ER Interaction Score derived from a collection of 13 ER in vitro assays. A total of 1537 chemicals from ToxCast were used to derive and optimize the binary classification models while 5073 additional chemicals from the Tox21 project, evaluated in 2 of the 13 in vitro assays, were used to externally validate the model performance. In order to handle the imbalanced distribution of active and inactive chemicals, we developed a cluster-selection strategy to minimize information loss and increase predictive performance and compared this strategy to three currently popular techniques: cost-sensitive learning, oversampling of the minority class, and undersampling of the majority class. QSAR classification models were built to relate the molecular structures of chemicals to their ER activities using linear discriminant analysis (LDA), classification and regression trees (CART), and support vector machines (SVM) with 51 molecular descriptors from QikProp and 4328 bits of structural fingerprints as explanatory variables. A random forest (RF) feature selection method was employed to extract the structural features most relevant to the ER activity. The best model was obtained using SVM in combination with a subset of descriptors identified from a large set via the RF algorithm, which recognized the active and inactive compounds at the accuracies of 76.1% and 82.8% with a total accuracy of 81.6% on the internal test set and 70.8% on the external test set. These results demonstrate that a combination of high-quality experimental data and ML methods can lead to robust models that achieve excellent predictive accuracy, which are potentially useful for facilitating the virtual screening of chemicals for environmental risk assessment.
有成千上万种环境化学物质需要就其内分泌干扰潜力做出监管决策。ToxCast和Tox21项目在一个广泛的体外高通量筛选(HTS)分析筛选组中,针对雌激素受体(ER)激动剂和拮抗剂活性测试了约8200种化学物质。本研究利用这个大型数据集,采用机器学习(ML)方法和一种处理不平衡数据分布的新方法,开发了计算机定量构效关系(QSAR)模型。来自ToxCast项目的训练化合物根据从13种ER体外分析收集得到的综合ER相互作用评分,被分类为活性或非活性(结合或非结合)类别。总共1537种来自ToxCast的化学物质用于推导和优化二元分类模型,而另外5073种来自Tox21项目的化学物质(在13种体外分析中的2种中进行了评估)用于外部验证模型性能。为了处理活性和非活性化学物质的不平衡分布,我们开发了一种聚类选择策略,以尽量减少信息损失并提高预测性能,并将该策略与目前三种流行技术进行比较:成本敏感学习、少数类过采样和多数类欠采样。利用线性判别分析(LDA)、分类与回归树(CART)以及支持向量机(SVM),以来自QikProp的51个分子描述符和4328位结构指纹作为解释变量,构建了QSAR分类模型,将化学物质的分子结构与其ER活性联系起来。采用随机森林(RF)特征选择方法提取与ER活性最相关的结构特征。使用SVM结合通过RF算法从大量数据中识别出的描述符子集获得了最佳模型,该模型在内部测试集上识别活性和非活性化合物的准确率分别为76.1%和82.8%,总准确率为81.6%,在外部测试集上的准确率为70.8%。这些结果表明,高质量的实验数据和ML方法相结合可以产生具有出色预测准确性的稳健模型,这对于促进用于环境风险评估的化学物质虚拟筛选可能是有用的。