Liu Jie, Patlewicz Grace, Williams Antony J, Thomas Russell S, Shah Imran
Department of Information Science, University of Arkansas at Little Rock , Arkansas 72204, United States.
Oak Ridge Institute for Science Education, National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency , Research Triangle Park, Durham, North Carolina 27711, United States.
Chem Res Toxicol. 2017 Nov 20;30(11):2046-2059. doi: 10.1021/acs.chemrestox.7b00084. Epub 2017 Oct 9.
Animal testing alone cannot practically evaluate the health hazard posed by tens of thousands of environmental chemicals. Computational approaches making use of high-throughput experimental data may provide more efficient means to predict chemical toxicity. Here, we use a supervised machine learning strategy to systematically investigate the relative importance of study type, machine learning algorithm, and type of descriptor on predicting in vivo repeat-dose toxicity at the organ-level. A total of 985 compounds were represented using chemical structural descriptors, ToxPrint chemotype descriptors, and bioactivity descriptors from ToxCast in vitro high-throughput screening assays. Using ToxRefDB, a total of 35 target organ outcomes were identified that contained at least 100 chemicals (50 positive and 50 negative). Supervised machine learning was performed using Naïve Bayes, k-nearest neighbor, random forest, classification and regression trees, and support vector classification approaches. Model performance was assessed based on F1 scores using 5-fold cross-validation with balanced bootstrap replicates. Fixed effects modeling showed the variance in F1 scores was explained mostly by target organ outcome, followed by descriptor type, machine learning algorithm, and interactions between these three factors. A combination of bioactivity and chemical structure or chemotype descriptors were the most predictive. Model performance improved with more chemicals (up to a maximum of 24%), and these gains were correlated (ρ = 0.92) with the number of chemicals. Overall, the results demonstrate that a combination of bioactivity and chemical descriptors can accurately predict a range of target organ toxicity outcomes in repeat-dose studies, but specific experimental and methodologic improvements may increase predictivity.
仅靠动物实验实际上无法评估成千上万种环境化学物质所造成的健康危害。利用高通量实验数据的计算方法可能会提供更有效的手段来预测化学物质的毒性。在此,我们使用一种监督式机器学习策略,系统地研究研究类型、机器学习算法和描述符类型对预测器官水平的体内重复剂量毒性的相对重要性。使用化学结构描述符、ToxPrint化学型描述符以及来自ToxCast体外高通量筛选试验的生物活性描述符,共表征了985种化合物。利用ToxRefDB,共确定了35种目标器官结果,每种结果包含至少100种化学物质(50种阳性和50种阴性)。使用朴素贝叶斯、k近邻、随机森林、分类与回归树以及支持向量分类方法进行监督式机器学习。使用5折交叉验证和平衡自助重复抽样,基于F1分数评估模型性能。固定效应模型显示,F1分数的方差主要由目标器官结果解释,其次是描述符类型、机器学习算法以及这三个因素之间的相互作用。生物活性与化学结构或化学型描述符的组合预测性最强。随着化学物质数量增加(最多增加24%),模型性能有所提升,且这些提升与化学物质数量相关(ρ = 0.92)。总体而言,结果表明生物活性和化学描述符的组合能够准确预测重复剂量研究中一系列目标器官的毒性结果,但特定的实验和方法改进可能会提高预测能力。