Department of Computer Science, University of Tsukuba, Tsukuba, Japan.
Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad615.
Numerous high-accuracy drug-target affinity (DTA) prediction models, whose performance is heavily reliant on the drug and target feature information, are developed at the expense of complexity and interpretability. Feature extraction and optimization constitute a critical step that significantly influences the enhancement of model performance, robustness, and interpretability. Many existing studies aim to comprehensively characterize drugs and targets by extracting features from multiple perspectives; however, this approach has drawbacks: (i) an abundance of redundant or noisy features; and (ii) the feature sets often suffer from high dimensionality.
In this study, to obtain a model with high accuracy and strong interpretability, we utilize various traditional and cutting-edge feature selection and dimensionality reduction techniques to process self-associated features and adjacent associated features. These optimized features are then fed into learning to rank to achieve efficient DTA prediction. Extensive experimental results on two commonly used datasets indicate that, among various feature optimization methods, the regression tree-based feature selection method is most beneficial for constructing models with good performance and strong robustness. Then, by utilizing Shapley Additive Explanations values and the incremental feature selection approach, we obtain that the high-quality feature subset consists of the top 150D features and the top 20D features have a breakthrough impact on the DTA prediction. In conclusion, our study thoroughly validates the importance of feature optimization in DTA prediction and serves as inspiration for constructing high-performance and high-interpretable models.
许多高精度药物-靶标亲和力(DTA)预测模型,其性能严重依赖于药物和靶标特征信息,这些模型的开发代价是复杂性和可解释性。特征提取和优化是一个关键步骤,它会显著影响模型性能、鲁棒性和可解释性的提升。许多现有的研究旨在通过从多个角度提取特征来全面描述药物和靶标;然而,这种方法有两个缺点:(i)存在大量冗余或嘈杂的特征;(ii)特征集往往具有高维度。
在这项研究中,为了获得具有高精度和强可解释性的模型,我们利用各种传统和前沿的特征选择和降维技术来处理自相关特征和相邻相关特征。然后,将这些优化后的特征输入到学习排序中,以实现高效的 DTA 预测。在两个常用数据集上的广泛实验结果表明,在各种特征优化方法中,基于回归树的特征选择方法最有利于构建性能良好、鲁棒性强的模型。然后,通过利用 Shapley Additive Explanations 值和增量特征选择方法,我们得出高质量特征子集由前 150D 特征组成,前 20D 特征对 DTA 预测有突破性影响。总之,我们的研究充分验证了特征优化在 DTA 预测中的重要性,并为构建高性能和高可解释性模型提供了启示。