Lee Munhwan, Kim Hyeyeon, Joe Hyunwhan, Kim Hong-Gee
Biomedical Knowledge Engineering Laboratory, Seoul National University, 1 Gwanak-ro, Seoul, Republic of Korea.
J Cheminform. 2019 Jul 9;11(1):46. doi: 10.1186/s13321-019-0368-1.
Analysis of compound-protein interactions (CPIs) has become a crucial prerequisite for drug discovery and drug repositioning. In vitro experiments are commonly used in identifying CPIs, but it is not feasible to discover the molecular and proteomic space only through experimental approaches. Machine learning's advances in predicting CPIs have made significant contributions to drug discovery. Deep neural networks (DNNs), which have recently been applied to predict CPIs, performed better than other shallow classifiers. However, such techniques commonly require a considerable volume of dense data for each training target. Although the number of publicly available CPI data has grown rapidly, public data is still sparse and has a large number of measurement errors. In this paper, we propose a novel method, Multi-channel PINN, to fully utilize sparse data in terms of representation learning. With representation learning, Multi-channel PINN can utilize three approaches of DNNs which are a classifier, a feature extractor, and an end-to-end learner. Multi-channel PINN can be fed with both low and high levels of representations and incorporates each of them by utilizing all approaches within a single model. To fully utilize sparse public data, we additionally explore the potential of transferring representations from training tasks to test tasks. As a proof of concept, Multi-channel PINN was evaluated on fifteen combinations of feature pairs to investigate how they affect the performance in terms of highest performance, initial performance, and convergence speed. The experimental results obtained indicate that the multi-channel models using protein features performed better than single-channel models or multi-channel models using compound features. Therefore, Multi-channel PINN can be advantageous when used with appropriate representations. Additionally, we pretrained models on a training task then finetuned them on a test task to figure out whether Multi-channel PINN can capture general representations for compounds and proteins. We found that there were significant differences in performance between pretrained models and non-pretrained models.
复合蛋白相互作用(CPI)分析已成为药物发现和药物重新定位的关键前提。体外实验常用于识别CPI,但仅通过实验方法发现分子和蛋白质组空间是不可行的。机器学习在预测CPI方面的进展为药物发现做出了重大贡献。最近应用于预测CPI的深度神经网络(DNN)比其他浅层分类器表现更好。然而,此类技术通常每个训练目标都需要大量密集数据。尽管公开可用的CPI数据数量迅速增长,但公共数据仍然稀疏且存在大量测量误差。在本文中,我们提出了一种新颖的方法——多通道物理信息神经网络(Multi-channel PINN),以在表示学习方面充分利用稀疏数据。通过表示学习,多通道物理信息神经网络可以利用DNN的三种方法,即分类器、特征提取器和端到端学习者。多通道物理信息神经网络可以输入低层次和高层次的表示,并通过在单个模型中利用所有方法将它们结合起来。为了充分利用稀疏的公共数据,我们还探索了将表示从训练任务转移到测试任务的潜力。作为概念验证,对多通道物理信息神经网络在十五个特征对组合上进行了评估,以研究它们如何在最高性能、初始性能和收敛速度方面影响性能。获得的实验结果表明,使用蛋白质特征的多通道模型比使用化合物特征的单通道模型或多通道模型表现更好。因此,多通道物理信息神经网络在与适当的表示一起使用时可能具有优势。此外,我们在训练任务上对模型进行预训练,然后在测试任务上对其进行微调,以弄清楚多通道物理信息神经网络是否可以捕获化合物和蛋白质的通用表示。我们发现预训练模型和未预训练模型在性能上存在显著差异。