一种用于异常值移除的多目标遗传算法。

Department of Management, Bar-Ilan University , Ramat-Gan 52900, Israel.

School of Management and Economics, The Academic College of Tel-Aviv - Yafo , Yafo 61083, Israel.

J Chem Inf Model. 2015 Dec 28;55(12):2507-18. doi: 10.1021/acs.jcim.5b00515. Epub 2015 Nov 23.

Quantitative structure activity relationship (QSAR) or quantitative structure property relationship (QSPR) models are developed to correlate activities for sets of compounds with their structure-derived descriptors by means of mathematical models. The presence of outliers, namely, compounds that differ in some respect from the rest of the data set, compromise the ability of statistical methods to derive QSAR models with good prediction statistics. Hence, outliers should be removed from data sets prior to model derivation. Here we present a new multi-objective genetic algorithm for the identification and removal of outliers based on the k nearest neighbors (kNN) method. The algorithm was used to remove outliers from three different data sets of pharmaceutical interest (logBBB, factor 7 inhibitors, and dihydrofolate reductase inhibitors), and its performances were compared with those of five other methods for outlier removal. The results suggest that the new algorithm provides filtered data sets that (1) better maintain the internal diversity of the parent data sets and (2) give rise to QSAR models with much better prediction statistics. Equally good filtered data sets in terms of these metrics were obtained when another objective function was added to the algorithm (termed "preservation"), forcing it to remove certain compounds with low probability only. This option is highly useful when specific compounds should be preferably kept in the final data set either because they have favorable activities or because they represent interesting molecular scaffolds. We expect this new algorithm to be useful in future QSAR applications.

定量构效关系（QSAR）或定量构性关系（QSPR）模型是通过数学模型开发的，用于将化合物组的活性与其结构衍生描述符相关联。离群值的存在，即某些化合物在某些方面与数据集的其余部分不同，会影响统计方法得出具有良好预测统计数据的 QSAR 模型的能力。因此，在推导模型之前，应从数据集中删除离群值。在这里，我们提出了一种新的基于 k 最近邻（kNN）方法的多目标遗传算法，用于识别和删除离群值。该算法用于从三个不同的药物相关数据集（logBBB、factor 7 抑制剂和二氢叶酸还原酶抑制剂）中删除离群值，并将其性能与其他五种离群值去除方法进行比较。结果表明，新算法提供了更好地保持原始数据集内部多样性的过滤数据集，并且产生了具有更好预测统计数据的 QSAR 模型。当向算法添加另一个目标函数（称为“保留”）以强制仅删除某些概率较低的化合物时，也可以获得在这些指标方面同样出色的过滤数据集。当特定化合物由于具有有利的活性或代表有趣的分子支架而应该优选保留在最终数据集中时，此选项非常有用。我们期望这种新算法在未来的 QSAR 应用中有用。

相似文献

A Multi-Objective Genetic Algorithm for Outlier Removal.

J Chem Inf Model. 2015 Dec 28;55(12):2507-18. doi: 10.1021/acs.jcim.5b00515. Epub 2015 Nov 23.

k-Nearest neighbors optimization-based outlier removal.

J Comput Chem. 2015 Mar 30;36(8):493-506. doi: 10.1002/jcc.23803. Epub 2014 Dec 15.

Optimization of molecular representativeness.

J Chem Inf Model. 2014 Jun 23;54(6):1567-77. doi: 10.1021/ci400715n. Epub 2014 May 19.

Combinatorial QSAR of ambergris fragrance compounds.

J Chem Inf Comput Sci. 2004 Mar-Apr;44(2):582-95. doi: 10.1021/ci034203t.

Evaluation of QSAR Equations for Virtual Screening.

Int J Mol Sci. 2020 Oct 22;21(21):7828. doi: 10.3390/ijms21217828.

4D-QSAR study of HEPT derivatives by electron conformational-genetic algorithm method.

SAR QSAR Environ Res. 2012 Jul;23(5-6):409-33. doi: 10.1080/1062936X.2012.665082. Epub 2012 Mar 27.

Regression Modelability Index: A New Index for Prediction of the Modelability of Data Sets in the Development of QSAR Regression Models.

J Chem Inf Model. 2018 Oct 22;58(10):2069-2084. doi: 10.1021/acs.jcim.8b00313. Epub 2018 Sep 25.

Fuzzy ARTMAP prediction of biological activities for potential HIV-1 protease inhibitors using a small molecular data set.

IEEE/ACM Trans Comput Biol Bioinform. 2011 Jan-Mar;8(1):80-93. doi: 10.1109/TCBB.2009.50.

General Approach to Estimate Error Bars for Quantitative Structure-Activity Relationship Predictions of Molecular Activity.

J Chem Inf Model. 2018 Aug 27;58(8):1561-1575. doi: 10.1021/acs.jcim.8b00114. Epub 2018 Jul 17.

Does rational selection of training and test sets improve the outcome of QSAR modeling?

J Chem Inf Model. 2012 Oct 22;52(10):2570-8. doi: 10.1021/ci300338w. Epub 2012 Oct 3.

引用本文的文献

RANdom SAmple Consensus (RANSAC) algorithm for material-informatics: application to photovoltaic solar cells.

J Cheminform. 2017 Jun 6;9(1):34. doi: 10.1186/s13321-017-0224-0.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

A Multi-Objective Genetic Algorithm for Outlier Removal.

J Chem Inf Model. 2015 Dec 28;55(12):2507-18. doi: 10.1021/acs.jcim.5b00515. Epub 2015 Nov 23.

k-Nearest neighbors optimization-based outlier removal.

J Comput Chem. 2015 Mar 30;36(8):493-506. doi: 10.1002/jcc.23803. Epub 2014 Dec 15.

Optimization of molecular representativeness.

J Chem Inf Model. 2014 Jun 23;54(6):1567-77. doi: 10.1021/ci400715n. Epub 2014 May 19.

Combinatorial QSAR of ambergris fragrance compounds.

J Chem Inf Comput Sci. 2004 Mar-Apr;44(2):582-95. doi: 10.1021/ci034203t.

Evaluation of QSAR Equations for Virtual Screening.

Int J Mol Sci. 2020 Oct 22;21(21):7828. doi: 10.3390/ijms21217828.

4D-QSAR study of HEPT derivatives by electron conformational-genetic algorithm method.

SAR QSAR Environ Res. 2012 Jul;23(5-6):409-33. doi: 10.1080/1062936X.2012.665082. Epub 2012 Mar 27.

Regression Modelability Index: A New Index for Prediction of the Modelability of Data Sets in the Development of QSAR Regression Models.

J Chem Inf Model. 2018 Oct 22;58(10):2069-2084. doi: 10.1021/acs.jcim.8b00313. Epub 2018 Sep 25.

Fuzzy ARTMAP prediction of biological activities for potential HIV-1 protease inhibitors using a small molecular data set.

IEEE/ACM Trans Comput Biol Bioinform. 2011 Jan-Mar;8(1):80-93. doi: 10.1109/TCBB.2009.50.

General Approach to Estimate Error Bars for Quantitative Structure-Activity Relationship Predictions of Molecular Activity.

J Chem Inf Model. 2018 Aug 27;58(8):1561-1575. doi: 10.1021/acs.jcim.8b00114. Epub 2018 Jul 17.

Does rational selection of training and test sets improve the outcome of QSAR modeling?

J Chem Inf Model. 2012 Oct 22;52(10):2570-8. doi: 10.1021/ci300338w. Epub 2012 Oct 3.

引用本文的文献

RANdom SAmple Consensus (RANSAC) algorithm for material-informatics: application to photovoltaic solar cells.

J Cheminform. 2017 Jun 6;9(1):34. doi: 10.1186/s13321-017-0224-0.

A Multi-Objective Genetic Algorithm for Outlier Removal.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献