Faculty of Medicine and Health Technology, Tampere University, Tampere 33200, Finland.
Department of Mathematics and Applications, University of Napoli Federico II, Naples 80138, Italy.
Bioinformatics. 2020 Jan 1;36(1):145-153. doi: 10.1093/bioinformatics/btz521.
Quantitative structure-activity relationship (QSAR) modelling is currently used in multiple fields to relate structural properties of compounds to their biological activities. This technique is also used for drug design purposes with the aim of predicting parameters that determine drug behaviour. To this end, a sophisticated process, involving various analytical steps concatenated in series, is employed to identify and fine-tune the optimal set of predictors from a large dataset of molecular descriptors (MDs). The search of the optimal model requires to optimize multiple objectives at the same time, as the aim is to obtain the minimal set of features that maximizes the goodness of fit and the applicability domain (AD). Hence, a multi-objective optimization strategy, improving multiple parameters in parallel, can be applied. Here we propose a new multi-niche multi-objective genetic algorithm that simultaneously enables stable feature selection as well as obtaining robust and validated regression models with maximized AD. We benchmarked our method on two simulated datasets. Moreover, we analyzed an aquatic acute toxicity dataset and compared the performances of single- and multi-objective fitness functions on different regression models. Our results show that our multi-objective algorithm is a valid alternative to classical QSAR modelling strategy, for continuous response values, since it automatically finds the model with the best compromise between statistical robustness, predictive performance, widest AD, and the smallest number of MDs.
The python implementation of MaNGA is available at https://github.com/Greco-Lab/MaNGA.
Supplementary data are available at Bioinformatics online.
定量构效关系(QSAR)建模目前被广泛应用于多个领域,用于将化合物的结构性质与其生物活性联系起来。该技术也被用于药物设计目的,旨在预测决定药物行为的参数。为此,采用了一种复杂的过程,涉及串联的多个分析步骤,用于从大量分子描述符(MDs)数据集中识别和微调最佳预测器集。搜索最佳模型需要同时优化多个目标,因为目标是获得最小的特征集,最大限度地提高拟合度和适用域(AD)。因此,可以应用一种多目标优化策略,同时并行优化多个参数。在这里,我们提出了一种新的多小生境多目标遗传算法,它可以同时实现稳定的特征选择,以及获得具有最大化 AD 的稳健和验证回归模型。我们在两个模拟数据集上对我们的方法进行了基准测试。此外,我们分析了一个水生急性毒性数据集,并比较了单目标和多目标适应度函数在不同回归模型上的性能。我们的结果表明,对于连续响应值,我们的多目标算法是经典 QSAR 建模策略的有效替代方案,因为它自动找到了在统计稳健性、预测性能、最宽 AD 和最小 MDs 数量之间具有最佳折衷的模型。
MaNGA 的 Python 实现可在 https://github.com/Greco-Lab/MaNGA 上获得。
补充数据可在生物信息学在线获得。