Banerjee Priyanka, Siramshetty Vishal B, Drwal Malgorzata N, Preissner Robert
Structural Bioinformatics Group, Institute for Physiology, Charité - University Medicine Berlin, Berlin, Germany ; Graduate School of Computational Systems Biology, Humboldt University of Berlin, Berlin, Germany.
Structural Bioinformatics Group, Experimental and Clinical Research Center (ECRC), Charité - University Medicine Berlin, Berlin, Germany ; BB3R - Berlin Brandenburg 3R Graduate School, Free University of Berlin, Berlin, Germany.
J Cheminform. 2016 Sep 29;8:51. doi: 10.1186/s13321-016-0162-2. eCollection 2016.
With a constant increase in the number of new chemicals synthesized every year, it becomes important to employ the most reliable and fast in silico screening methods to predict their safety and activity profiles. In recent years, in silico prediction methods received great attention in an attempt to reduce animal experiments for the evaluation of various toxicological endpoints, complementing the theme of replace, reduce and refine. Various computational approaches have been proposed for the prediction of compound toxicity ranging from quantitative structure activity relationship modeling to molecular similarity-based methods and machine learning. Within the "Toxicology in the 21st Century" screening initiative, a crowd-sourcing platform was established for the development and validation of computational models to predict the interference of chemical compounds with nuclear receptor and stress response pathways based on a training set containing more than 10,000 compounds tested in high-throughput screening assays.
Here, we present the results of various molecular similarity-based and machine-learning based methods over an independent evaluation set containing 647 compounds as provided by the Tox21 Data Challenge 2014. It was observed that the Random Forest approach based on MACCS molecular fingerprints and a subset of 13 molecular descriptors selected based on statistical and literature analysis performed best in terms of the area under the receiver operating characteristic curve values. Further, we compared the individual and combined performance of different methods. In retrospect, we also discuss the reasons behind the superior performance of an ensemble approach, combining a similarity search method with the Random Forest algorithm, compared to individual methods while explaining the intrinsic limitations of the latter.
Our results suggest that, although prediction methods were optimized individually for each modelled target, an ensemble of similarity and machine-learning approaches provides promising performance indicating its broad applicability in toxicity prediction.
随着每年合成的新化学品数量不断增加,采用最可靠、快速的计算机模拟筛选方法来预测其安全性和活性概况变得至关重要。近年来,计算机模拟预测方法备受关注,旨在减少用于评估各种毒理学终点的动物实验,以补充替代、减少和优化的主题。已经提出了各种计算方法来预测化合物毒性,从定量构效关系建模到基于分子相似性的方法和机器学习。在“21世纪毒理学”筛选计划中,建立了一个众包平台,用于开发和验证计算模型,以基于包含在高通量筛选试验中测试的10000多种化合物的训练集来预测化合物对核受体和应激反应途径的干扰。
在此,我们展示了基于各种分子相似性和机器学习方法对由2014年Tox21数据挑战赛提供的包含647种化合物的独立评估集的结果。据观察,基于MACCS分子指纹和基于统计及文献分析选择的13个分子描述符子集的随机森林方法在受试者工作特征曲线下面积值方面表现最佳。此外,我们比较了不同方法的个体和组合性能。回顾过去,我们还讨论了与个体方法相比,将相似性搜索方法与随机森林算法相结合的集成方法表现更优的背后原因,同时解释了后者的内在局限性。
我们的结果表明,尽管预测方法是针对每个建模目标单独优化的,但相似性和机器学习方法的组合提供了有前景的性能,表明其在毒性预测中具有广泛的适用性。