Suppr超能文献

通过更好的特征选择来优化用于致突变性预测的机器学习模型。

Optimizing machine-learning models for mutagenicity prediction through better feature selection.

机构信息

SBX Corporation, Tokyo, Japan.

Global Drug Safety, Eisai Co., Ltd., Tokyo, Japan.

出版信息

Mutagenesis. 2022 Oct 26;37(3-4):191-202. doi: 10.1093/mutage/geac010.

Abstract

Assessing a compound's mutagenicity using machine learning is an important activity in the drug discovery and development process. Traditional methods of mutagenicity detection, such as Ames test, are expensive and time and labor intensive. In this context, in silico methods that predict a compound mutagenicity with high accuracy are important. Recently, machine-learning (ML) models are increasingly being proposed to improve the accuracy of mutagenicity prediction. While these models are used in practice, there is further scope to improve the accuracy of these models. We hypothesize that choosing the right features to train the model can further lead to better accuracy. We systematically consider and evaluate a combination of novel structural and molecular features which have the maximal impact on the accuracy of models. We rigorously evaluate these features against multiple classification models (from classical ML models to deep neural network models). The performance of the models was assessed using 5- and 10-fold cross-validation and we show that our approach using the molecule structure, molecular properties, and structural alerts as feature sets successfully outperform the state-of-the-art methods for mutagenicity prediction for the Hansen et al. benchmark dataset with an area under the receiver operating characteristic curve of 0.93. More importantly, our framework shows how combining features could benefit model accuracy improvements.

摘要

使用机器学习评估化合物的致突变性是药物发现和开发过程中的一项重要活动。传统的致突变性检测方法,如艾姆斯试验,既昂贵又费时费力。在这种情况下,能够准确预测化合物致突变性的计算方法就显得尤为重要。最近,越来越多的机器学习 (ML) 模型被提出,以提高致突变性预测的准确性。虽然这些模型在实践中得到了应用,但进一步提高这些模型的准确性仍有很大的空间。我们假设选择正确的特征来训练模型可以进一步提高准确性。我们系统地考虑和评估了一系列对模型准确性有最大影响的新型结构和分子特征。我们严格地将这些特征与多种分类模型(从经典的机器学习模型到深度神经网络模型)进行了比较。我们使用 5 折和 10 折交叉验证来评估模型的性能,结果表明,我们使用分子结构、分子性质和结构警报作为特征集的方法在 Hansen 等人的基准数据集上的预测性能明显优于最先进的方法,接收器操作特征曲线下的面积为 0.93。更重要的是,我们的框架展示了如何结合特征可以提高模型的准确性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验