Atas Guvenilir Heval, Doğan Tunca
Biological Data Science Laboratory, Department of Computer Engineering, Hacettepe University, Ankara, Turkey.
Department of Health Informatics, Graduate School of Informatics, METU, Ankara, Turkey.
J Cheminform. 2023 Feb 6;15(1):16. doi: 10.1186/s13321-023-00689-w.
The identification of drug/compound-target interactions (DTIs) constitutes the basis of drug discovery, for which computational predictive approaches have been developed. As a relatively new data-driven paradigm, proteochemometric (PCM) modeling utilizes both protein and compound properties as a pair at the input level and processes them via statistical/machine learning. The representation of input samples (i.e., proteins and their ligands) in the form of quantitative feature vectors is crucial for the extraction of interaction-related properties during the artificial learning and subsequent prediction of DTIs. Lately, the representation learning approach, in which input samples are automatically featurized via training and applying a machine/deep learning model, has been utilized in biomedical sciences. In this study, we performed a comprehensive investigation of different computational approaches/techniques for protein featurization (including both conventional approaches and the novel learned embeddings), data preparation and exploration, machine learning-based modeling, and performance evaluation with the aim of achieving better data representations and more successful learning in DTI prediction. For this, we first constructed realistic and challenging benchmark datasets on small, medium, and large scales to be used as reliable gold standards for specific DTI modeling tasks. We developed and applied a network analysis-based splitting strategy to divide datasets into structurally different training and test folds. Using these datasets together with various featurization methods, we trained and tested DTI prediction models and evaluated their performance from different angles. Our main findings can be summarized under 3 items: (i) random splitting of datasets into train and test folds leads to near-complete data memorization and produce highly over-optimistic results, as a result, should be avoided, (ii) learned protein sequence embeddings work well in DTI prediction and offer high potential, despite interaction-related properties (e.g., structures) of proteins are unused during their self-supervised model training, and (iii) during the learning process, PCM models tend to rely heavily on compound features while partially ignoring protein features, primarily due to the inherent bias in DTI data, indicating the requirement for new and unbiased datasets. We hope this study will aid researchers in designing robust and high-performing data-driven DTI prediction systems that have real-world translational value in drug discovery.
药物/化合物-靶点相互作用(DTIs)的识别是药物发现的基础,为此人们开发了计算预测方法。作为一种相对较新的数据驱动范式,蛋白质化学计量学(PCM)建模在输入层面将蛋白质和化合物属性作为一对来利用,并通过统计/机器学习对其进行处理。以定量特征向量的形式表示输入样本(即蛋白质及其配体)对于在人工学习和随后的DTIs预测过程中提取相互作用相关属性至关重要。最近,表征学习方法已被应用于生物医学科学领域,在该方法中,通过训练和应用机器学习/深度学习模型自动对输入样本进行特征提取。在本研究中,我们对蛋白质特征提取的不同计算方法/技术(包括传统方法和新的学习嵌入)、数据准备与探索、基于机器学习的建模以及性能评估进行了全面研究,目的是在DTI预测中实现更好的数据表示和更成功的学习。为此,我们首先构建了小、中、大规模的现实且具有挑战性的基准数据集,用作特定DTI建模任务的可靠金标准。我们开发并应用了一种基于网络分析的拆分策略,将数据集划分为结构不同的训练集和测试集。使用这些数据集以及各种特征提取方法,我们训练和测试了DTI预测模型,并从不同角度评估了它们的性能。我们的主要发现可概括为三点:(i)将数据集随机拆分为训练集和测试集会导致近乎完全的数据记忆,并产生高度乐观的结果,因此应避免这种做法;(ii)学习到的蛋白质序列嵌入在DTI预测中表现良好且具有很大潜力,尽管蛋白质的相互作用相关属性(如结构)在其自监督模型训练过程中未被使用;(iii)在学习过程中,PCM模型倾向于严重依赖化合物特征,而部分忽略蛋白质特征,这主要是由于DTI数据中存在固有偏差,这表明需要新的、无偏差的数据集。我们希望这项研究将有助于研究人员设计出强大且高性能的数据驱动DTI预测系统,这些系统在药物发现中具有实际的转化价值。