Suppr超能文献

将 Mondrian 交叉保形预测应用于大型不平衡生物活性数据集的预测置信度估计。

Applying Mondrian Cross-Conformal Prediction To Estimate Prediction Confidence on Large Imbalanced Bioactivity Data Sets.

机构信息

Swetox, Karolinska Institutet , Unit of Toxicology Sciences, Södertälje 15136, Sweden.

出版信息

J Chem Inf Model. 2017 Jul 24;57(7):1591-1598. doi: 10.1021/acs.jcim.7b00159. Epub 2017 Jun 30.

Abstract

Conformal prediction has been proposed as a more rigorous way to define prediction confidence compared to other application domain concepts that have earlier been used for QSAR modeling. One main advantage of such a method is that it provides a prediction region potentially with multiple predicted labels, which contrasts to the single valued (regression) or single label (classification) output predictions by standard QSAR modeling algorithms. Standard conformal prediction might not be suitable for imbalanced data sets. Therefore, Mondrian cross-conformal prediction (MCCP) which combines the Mondrian inductive conformal prediction with cross-fold calibration sets has been introduced. In this study, the MCCP method was applied to 18 publicly available data sets that have various imbalance levels varying from 1:10 to 1:1000 (ratio of active/inactive compounds). Our results show that MCCP in general performed well on bioactivity data sets with various imbalance levels. More importantly, the method not only provides confidence of prediction and prediction regions compared to standard machine learning methods but also produces valid predictions for the minority class. In addition, a compound similarity based nonconformity measure was investigated. Our results demonstrate that although it gives valid predictions, its efficiency is much worse than that of model dependent metrics.

摘要

与其他更早应用于 QSAR 建模的应用领域概念相比,保形预测被提出作为一种更严格的定义预测置信度的方法。这种方法的一个主要优点是,它提供了一个潜在的具有多个预测标签的预测区域,与标准 QSAR 建模算法的单值(回归)或单标签(分类)输出预测形成对比。标准的保形预测可能不适合不平衡数据集。因此,引入了蒙地卡罗交叉保形预测(MCCP),它将蒙地卡罗归纳保形预测与交叉折叠校准集相结合。在这项研究中,MCCP 方法应用于 18 个公开可用的数据集,这些数据集具有从 1:10 到 1:1000 不等的各种不平衡水平(活性/非活性化合物的比例)。我们的结果表明,MCCP 通常在具有各种不平衡水平的生物活性数据集上表现良好。更重要的是,该方法不仅提供了预测置信度和预测区域,与标准机器学习方法相比,还为少数类提供了有效的预测。此外,还研究了基于化合物相似性的不合规度量。我们的结果表明,尽管它给出了有效的预测,但它的效率比依赖模型的指标差得多。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验