Balabin Ilya A, Judson Richard S
Leidos, Inc., 109 TW Alexander Drive, MD N127-01, Research Triangle Park, NC, 27711, USA.
US EPA, 109 TW Alexander Drive, ORD, NCCT, Research Triangle Park, NC, 27711, USA.
J Cheminform. 2018 Sep 18;10(1):47. doi: 10.1186/s13321-018-0300-0.
Quantitative structure-activity relationship (QSAR) models are important tools used in discovering new drug candidates and identifying potentially harmful environmental chemicals. These models often face two fundamental challenges: limited amount of available biological activity data and noise or uncertainty in the activity data themselves. To address these challenges, we introduce and explore a QSAR model based on custom distance metrics in the structure-activity space.
The model is built on top of the k-nearest neighbor model, incorporating non-linearity not only in the chemical structure space, but also in the biological activity space. The model is tuned and evaluated using activity data for human estrogen receptor from the US EPA ToxCast and Tox21 databases.
The model closely trails the CERAPP consensus model (built on top of 48 individual human estrogen receptor activity models) in agonist activity predictions and consistently outperforms the CERAPP consensus model in antagonist activity predictions.
We suggest that incorporating non-linear distance metrics may significantly improve QSAR model performance when the available biological activity data are limited.
定量构效关系(QSAR)模型是发现新的候选药物和识别潜在有害环境化学物质的重要工具。这些模型通常面临两个基本挑战:可用生物活性数据量有限以及活性数据本身存在噪声或不确定性。为应对这些挑战,我们引入并探索了一种基于构效空间中自定义距离度量的QSAR模型。
该模型基于k近邻模型构建,不仅在化学结构空间中纳入了非线性,还在生物活性空间中纳入了非线性。使用来自美国环境保护局(EPA)ToxCast和Tox21数据库的人类雌激素受体活性数据对模型进行调整和评估。
在激动剂活性预测方面,该模型紧随CERAPP共识模型(基于48个人类雌激素受体活性模型构建),并且在拮抗剂活性预测方面始终优于CERAPP共识模型。
我们认为,当可用生物活性数据有限时,纳入非线性距离度量可能会显著提高QSAR模型的性能。