Stolbov L A, Filimonov D A, Poroikov V V
Laboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russian Federation.
SAR QSAR Environ Res. 2022 Oct;33(10):793-804. doi: 10.1080/1062936X.2022.2139751.
The accuracy and performance of (Q)SAR models depend significantly on the data used for training. Datasets prepared on the basis of publicly available databases contain structures belonging to different chemical classes and have a highly imbalanced actives/inactives ratio. Currently, hundreds of structural descriptors are used in (Q)SAR studies. The abundance of structural descriptors gives rise to the problem of the constructed (Q)SAR models stability. The methods frequently used for the selection of a small fraction of the 'best' descriptors usually do not have sufficient mathematical justification. We propose a new approach to a self-consistent classifier for SAR analysis in order to overcome these problems. Logistic (SCLC) and extreme (SCEC) extensions of self-consistent regression (SCR) were implemented to enhance the classification capabilities of SCR. The approach was applied to classification models' development for inhibiting activity endpoints in HIV-1-related data and toxicity endpoints with subsequent fivefold cross-validation to estimate the models' performance. Comparison of the proposed SCLC and SCEC models with those developed using the original SCR and support vector machine demonstrated the comparable accuracy. Advantages in feature selection using our approach provide more generalizable (Q)SAR models. In particular, the crucial factors responsible for the observed value are determined unambiguously.
(定量)构效关系(QSAR)模型的准确性和性能在很大程度上取决于用于训练的数据。基于公开可用数据库制备的数据集包含属于不同化学类别的结构,并且活性/非活性比例高度失衡。目前,(定量)构效关系研究中使用了数百种结构描述符。大量的结构描述符导致了所构建的(定量)构效关系模型稳定性的问题。常用于选择一小部分“最佳”描述符的方法通常没有充分的数学依据。为了克服这些问题,我们提出了一种用于SAR分析的自洽分类器的新方法。实施了自洽回归(SCR)的逻辑(SCLC)和极端(SCEC)扩展,以增强SCR的分类能力。该方法应用于HIV-1相关数据中抑制活性终点和毒性终点的分类模型开发,并随后进行五重交叉验证以评估模型的性能。将所提出的SCLC和SCEC模型与使用原始SCR和支持向量机开发的模型进行比较,结果表明准确性相当。使用我们的方法进行特征选择的优势提供了更具通用性的(定量)构效关系模型。特别是,明确确定了导致观测值的关键因素。