School of Life Science, Liaoning University, Shenyang, 110036, China.
Research Center for Computer Simulating and Information Processing of Bio-macromolecules of Liaoning Province, Shenyang, 110036, China.
Sci Rep. 2017 May 18;7(1):2118. doi: 10.1038/s41598-017-02365-0.
Carcinogenicity refers to a highly toxic end point of certain chemicals, and has become an important issue in the drug development process. In this study, three novel ensemble classification models, namely Ensemble SVM, Ensemble RF, and Ensemble XGBoost, were developed to predict carcinogenicity of chemicals using seven types of molecular fingerprints and three machine learning methods based on a dataset containing 1003 diverse compounds with rat carcinogenicity. Among these three models, Ensemble XGBoost is found to be the best, giving an average accuracy of 70.1 ± 2.9%, sensitivity of 67.0 ± 5.0%, and specificity of 73.1 ± 4.4% in five-fold cross-validation and an accuracy of 70.0%, sensitivity of 65.2%, and specificity of 76.5% in external validation. In comparison with some recent methods, the ensemble models outperform some machine learning-based approaches and yield equal accuracy and higher specificity but lower sensitivity than rule-based expert systems. It is also found that the ensemble models could be further improved if more data were available. As an application, the ensemble models are employed to discover potential carcinogens in the DrugBank database. The results indicate that the proposed models are helpful in predicting the carcinogenicity of chemicals. A web server called CarcinoPred-EL has been built for these models ( http://ccsipb.lnu.edu.cn/toxicity/CarcinoPred-EL/ ).
致癌性是指某些化学物质的高度毒性终点,已成为药物开发过程中的一个重要问题。在这项研究中,开发了三种新型集成分类模型,即集成 SVM、集成 RF 和集成 XGBoost,使用七种分子指纹和三种基于包含 1003 种具有大鼠致癌性的不同化合物的数据集的机器学习方法来预测化学物质的致癌性。在这三个模型中,发现集成 XGBoost 是最好的,在五重交叉验证中平均准确率为 70.1±2.9%,灵敏度为 67.0±5.0%,特异性为 73.1±4.4%,外部验证的准确率为 70.0%,灵敏度为 65.2%,特异性为 76.5%。与一些最近的方法相比,集成模型优于一些基于机器学习的方法,并且在准确性和特异性方面与基于规则的专家系统相当,但灵敏度较低。如果有更多的数据,还可以进一步改进集成模型。作为一种应用,将集成模型用于在 DrugBank 数据库中发现潜在的致癌物质。结果表明,所提出的模型有助于预测化学物质的致癌性。已经为这些模型建立了一个名为 CarcinoPred-EL 的网络服务器(http://ccsipb.lnu.edu.cn/toxicity/CarcinoPred-EL/)。