Computer School, Hubei University of Arts and Science, Xiangyang, 441053, China.
Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, 333403, China.
BMC Bioinformatics. 2020 Apr 20;21(1):150. doi: 10.1186/s12859-020-3488-8.
G protein-coupled receptors (GPCRs) mediate a variety of important physiological functions, are closely related to many diseases, and constitute the most important target family of modern drugs. Therefore, the research of GPCR analysis and GPCR ligand screening is the hotspot of new drug development. Accurately identifying the GPCR-drug interaction is one of the key steps for designing GPCR-targeted drugs. However, it is prohibitively expensive to experimentally ascertain the interaction of GPCR-drug pairs on a large scale. Therefore, it is of great significance to predict the interaction of GPCR-drug pairs directly from the molecular sequences. With the accumulation of known GPCR-drug interaction data, it is feasible to develop sequence-based machine learning models for query GPCR-drug pairs.
In this paper, a new sequence-based method is proposed to identify GPCR-drug interactions. For GPCRs, we use a novel bag-of-words (BoW) model to extract sequence features, which can extract more pattern information from low-order to high-order and limit the feature space dimension. For drug molecules, we use discrete Fourier transform (DFT) to extract higher-order pattern information from the original molecular fingerprints. The feature vectors of two kinds of molecules are concatenated and input into a simple prediction engine distance-weighted K-nearest-neighbor (DWKNN). This basic method is easy to be enhanced through ensemble learning. Through testing on recently constructed GPCR-drug interaction datasets, it is found that the proposed methods are better than the existing sequence-based machine learning methods in generalization ability, even an unconventional method in which the prediction performance was further improved by post-processing procedure (PPP).
The proposed methods are effective for GPCR-drug interaction prediction, and may also be potential methods for other target-drug interaction prediction, or protein-protein interaction prediction. In addition, the new proposed feature extraction method for GPCR sequences is the modified version of the traditional BoW model and may be useful to solve problems of protein classification or attribute prediction. The source code of the proposed methods is freely available for academic research at https://github.com/wp3751/GPCR-Drug-Interaction.
G 蛋白偶联受体(GPCRs)介导多种重要的生理功能,与许多疾病密切相关,构成现代药物最重要的靶标家族。因此,GPCR 分析和 GPCR 配体筛选的研究是新药开发的热点。准确识别 GPCR-药物相互作用是设计 GPCR 靶向药物的关键步骤之一。然而,大规模实验确定 GPCR-药物对的相互作用成本过高。因此,直接从分子序列预测 GPCR-药物对的相互作用具有重要意义。随着已知 GPCR-药物相互作用数据的积累,开发基于序列的机器学习模型来查询 GPCR-药物对是可行的。
本文提出了一种新的基于序列的方法来识别 GPCR-药物相互作用。对于 GPCR,我们使用一种新的词袋(BoW)模型来提取序列特征,该模型可以从低阶到高阶提取更多模式信息,并限制特征空间维度。对于药物分子,我们使用离散傅里叶变换(DFT)从原始分子指纹中提取高阶模式信息。两种分子的特征向量串联起来并输入到简单的预测引擎距离加权 K-最近邻(DWKNN)中。这种基本方法可以通过集成学习很容易地增强。通过对最近构建的 GPCR-药物相互作用数据集进行测试,发现所提出的方法在泛化能力方面优于现有的基于序列的机器学习方法,甚至通过后处理程序(PPP)进一步提高预测性能的非常规方法。
所提出的方法对 GPCR-药物相互作用预测有效,也可能是其他靶标-药物相互作用预测或蛋白质-蛋白质相互作用预测的潜在方法。此外,用于 GPCR 序列的新提出的特征提取方法是传统 BoW 模型的改进版本,可能有助于解决蛋白质分类或属性预测问题。所提出方法的源代码可在 https://github.com/wp3751/GPCR-Drug-Interaction 上免费获取,供学术研究使用。