Suppr超能文献

一种用于异常值移除的多目标遗传算法。

A Multi-Objective Genetic Algorithm for Outlier Removal.

机构信息

Department of Management, Bar-Ilan University , Ramat-Gan 52900, Israel.

School of Management and Economics, The Academic College of Tel-Aviv - Yafo , Yafo 61083, Israel.

出版信息

J Chem Inf Model. 2015 Dec 28;55(12):2507-18. doi: 10.1021/acs.jcim.5b00515. Epub 2015 Nov 23.

Abstract

Quantitative structure activity relationship (QSAR) or quantitative structure property relationship (QSPR) models are developed to correlate activities for sets of compounds with their structure-derived descriptors by means of mathematical models. The presence of outliers, namely, compounds that differ in some respect from the rest of the data set, compromise the ability of statistical methods to derive QSAR models with good prediction statistics. Hence, outliers should be removed from data sets prior to model derivation. Here we present a new multi-objective genetic algorithm for the identification and removal of outliers based on the k nearest neighbors (kNN) method. The algorithm was used to remove outliers from three different data sets of pharmaceutical interest (logBBB, factor 7 inhibitors, and dihydrofolate reductase inhibitors), and its performances were compared with those of five other methods for outlier removal. The results suggest that the new algorithm provides filtered data sets that (1) better maintain the internal diversity of the parent data sets and (2) give rise to QSAR models with much better prediction statistics. Equally good filtered data sets in terms of these metrics were obtained when another objective function was added to the algorithm (termed "preservation"), forcing it to remove certain compounds with low probability only. This option is highly useful when specific compounds should be preferably kept in the final data set either because they have favorable activities or because they represent interesting molecular scaffolds. We expect this new algorithm to be useful in future QSAR applications.

摘要

定量构效关系(QSAR)或定量构性关系(QSPR)模型是通过数学模型开发的,用于将化合物组的活性与其结构衍生描述符相关联。离群值的存在,即某些化合物在某些方面与数据集的其余部分不同,会影响统计方法得出具有良好预测统计数据的 QSAR 模型的能力。因此,在推导模型之前,应从数据集中删除离群值。在这里,我们提出了一种新的基于 k 最近邻(kNN)方法的多目标遗传算法,用于识别和删除离群值。该算法用于从三个不同的药物相关数据集(logBBB、factor 7 抑制剂和二氢叶酸还原酶抑制剂)中删除离群值,并将其性能与其他五种离群值去除方法进行比较。结果表明,新算法提供了更好地保持原始数据集内部多样性的过滤数据集,并且产生了具有更好预测统计数据的 QSAR 模型。当向算法添加另一个目标函数(称为“保留”)以强制仅删除某些概率较低的化合物时,也可以获得在这些指标方面同样出色的过滤数据集。当特定化合物由于具有有利的活性或代表有趣的分子支架而应该优选保留在最终数据集中时,此选项非常有用。我们期望这种新算法在未来的 QSAR 应用中有用。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验