Allen Chad H G, Koutsoukas Alexios, Cortés-Ciriano Isidro, Murrell Daniel S, Malliavin Thérèse E, Glen Robert C, Bender Andreas
Centre for Molecular Informatics , Department of Chemistry , Lensfield Road , Cambridge CB2 1EW , UK . Email:
Unité de Bioinformatique Structurale , Institut Pasteur and CNRS UMR 3528 , Structural Biology and Chemistry Department , Paris , France.
Toxicol Res (Camb). 2016 Mar 3;5(3):883-894. doi: 10.1039/c5tx00406c. eCollection 2016 May 1.
Prediction of compound toxicity is essential because covering the vast chemical space requiring safety assessment using traditional experimentally-based, resource-intensive techniques is impossible. However, such prediction is nontrivial due to the complex causal relationship between compound structure and harm. Protein target annotations and experimental outcomes encode relevant bioactivity information complementary to chemicals' structures. This work tests the hypothesis that utilizing three complementary types of data will afford predictive models that outperform traditional models built using fewer data types. A tripartite, heterogeneous descriptor set for 367 compounds was comprised of (a) chemical descriptors, (b) protein target descriptors generated using an algorithm trained on 190 000 ligand-protein interactions from ChEMBL, and (c) descriptors derived from cell cytotoxicity dose-response data from a panel of human cell lines. 100 random forests classification models for predicting rat LD were built using every combination of descriptors. Successive integration of data types improved predictive performance; models built using the full dataset had an average external correct classification rate of 0.82, compared to 0.73-0.80 for models built using two data types and 0.67-0.78 for models built using one. Pairwise comparisons of models trained on the same data showed that including a third data domain on top of chemistry improved average correct classification rate by 1.4-2.4 points, with -values <0.01. Additionally, the approach enhanced the models' applicability domains and proved useful for generating novel mechanism hypotheses. The use of tripartite heterogeneous bioactivity datasets is a useful technique for improving toxicity prediction. Both protein target descriptors - which have the practical value of being derived - and cytotoxicity descriptors derived from experiment are suitable contributors to such datasets.
化合物毒性预测至关重要,因为要涵盖使用传统基于实验的、资源密集型技术进行安全评估所需的广阔化学空间是不可能的。然而,由于化合物结构与危害之间存在复杂的因果关系,这种预测并非易事。蛋白质靶点注释和实验结果编码了与化学物质结构互补的相关生物活性信息。这项工作检验了这样一个假设,即利用三种互补类型的数据将提供比使用较少数据类型构建的传统模型表现更优的预测模型。针对367种化合物的三方异构描述符集由以下部分组成:(a)化学描述符,(b)使用在来自ChEMBL的190000种配体 - 蛋白质相互作用上训练的算法生成的蛋白质靶点描述符,以及(c)来自一组人类细胞系的细胞细胞毒性剂量 - 反应数据衍生的描述符。使用描述符的每种组合构建了100个用于预测大鼠半数致死剂量(LD)的随机森林分类模型。数据类型的连续整合提高了预测性能;使用完整数据集构建的模型平均外部正确分类率为0.82,而使用两种数据类型构建的模型为0.73 - 0.80,使用一种数据类型构建的模型为0.67 - 0.78。在相同数据上训练的模型的成对比较表明,在化学基础上加入第三个数据域可使平均正确分类率提高1.4 - 2.4个百分点,p值<0.01。此外,该方法扩展了模型的适用范围,并证明有助于生成新的作用机制假设。使用三方异构生物活性数据集是一种改善毒性预测的有用技术。蛋白质靶点描述符(具有源自实际的实用价值)和源自实验的细胞毒性描述符都是此类数据集的合适贡献者。