Lee Kyunghoon, Jang Jinho, Seo Seonghwan, Lim Jaechang, Kim Woo Youn
Department of Chemistry, KAIST 291 Daehak-ro, Yuseong-gu Daejeon 34 141 Republic of Korea
HITS Incorporation 124 Teheran-ro, Gangnam-gu Seoul 06 234 Republic of Korea
Chem Sci. 2021 Dec 14;13(2):554-565. doi: 10.1039/d1sc05248a. eCollection 2022 Jan 5.
Drug-likeness prediction is important for the virtual screening of drug candidates. It is challenging because the drug-likeness is presumably associated with the whole set of necessary properties to pass through clinical trials, and thus no definite data for regression is available. Recently, binary classification models based on graph neural networks have been proposed but with strong dependency of their performances on the choice of the negative set for training. Here we propose a novel unsupervised learning model that requires only known drugs for training. We adopted a language model based on a recurrent neural network for unsupervised learning. It showed relatively consistent performance across different datasets, unlike such classification models. In addition, the unsupervised learning model provides drug-likeness scores that well separate distributions with increasing mean values in the order of datasets composed of molecules at a later step in a drug development process, whereas the classification model predicted a polarized distribution with two extreme values for all datasets presumably due to the overconfident prediction for unseen data. Thus, this new concept offers a pragmatic tool for drug-likeness scoring and further can be applied to other biochemical applications.
药物相似性预测对于药物候选物的虚拟筛选很重要。这具有挑战性,因为药物相似性大概与通过临床试验所需的全套性质相关联,因此没有用于回归的确切数据。最近,基于图神经网络的二元分类模型已被提出,但它们的性能强烈依赖于训练负样本集的选择。在此,我们提出了一种新颖的无监督学习模型,该模型仅需要已知药物进行训练。我们采用了基于循环神经网络的语言模型进行无监督学习。与这类分类模型不同,它在不同数据集上表现出相对一致的性能。此外,无监督学习模型提供的药物相似性分数能够很好地分离不同分布,这些分布随着由处于药物开发过程后期的分子组成的数据集的均值增加而有序排列,而分类模型大概由于对未见数据的过度自信预测,对所有数据集都预测出具有两个极值的极化分布。因此,这个新概念为药物相似性评分提供了一个实用工具,并且进一步可应用于其他生化应用。