Suppr超能文献

基于随机森林学习者的不平衡数据中酚类化合物毒性作用机制的计算预测。

In silico prediction of toxic action mechanisms of phenols for imbalanced data with Random Forest learner.

机构信息

College of Computer Science, Chongqing University, Chongqing 400030, China.

出版信息

J Mol Graph Model. 2012 May;35:21-7. doi: 10.1016/j.jmgm.2012.01.002. Epub 2012 Jan 17.

Abstract

With an increasing need for the rapid and effective safety assessment of compounds in industrial and civil-use products, in silico toxicity exploration techniques provide an economic way for environmental hazard assessment. The previous in silico researches have developed many quantitative structure-activity relationships models to predict toxicity mechanisms for last decade. Most of these methods benefit from data analysis and machine learning techniques, which rely heavily on the characteristics of data sets. For Tetrahymena pyriformis toxicity data sets, there is a great technical challenge-data imbalance. The skewness of data class distribution would greatly deteriorate the prediction performance on rare classes. Most of the previous researches for phenol mechanisms of toxic action prediction did not consider this practical problem. In this work, we dealt with the problem by considering the difference between the two types of misclassifications. Random Forest learner was employed in cost-sensitive learning framework to construct prediction models based on selected molecular descriptors. In computational experiments, both the global and local models obtained appreciable overall prediction accuracies. Particularly, the performance on rare classes was indeed promoted. Moreover, for practical usage of these models, the balance of the two misclassifications can be adjusted by using different cost matrices according to the application goals.

摘要

随着对工业和民用产品中化合物快速有效安全评估的需求不断增加,基于计算的毒性探索技术为环境危害评估提供了一种经济的方法。过去十年,许多定量构效关系模型已被用于研究开发,以预测毒性机制。这些方法大多受益于数据分析和机器学习技术,而这些技术严重依赖于数据集的特征。对于四膜虫毒性数据集,存在一个巨大的技术挑战——数据不平衡。数据类分布的偏斜会极大地降低稀有类别的预测性能。之前大多数关于苯酚毒性作用机制预测的研究都没有考虑到这个实际问题。在这项工作中,我们通过考虑两种类型的错误分类之间的差异来处理这个问题。随机森林学习者被应用于基于选择的分子描述符的成本敏感学习框架中,以构建预测模型。在计算实验中,所获得的全局和局部模型都具有可观的整体预测精度。特别是,稀有类别的性能确实得到了提高。此外,对于这些模型的实际应用,可以根据应用目标使用不同的代价矩阵来调整这两种误分类的平衡。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验