Zhang Xiao-Chen, Wu Cheng-Kun, Yi Jia-Cai, Zeng Xiang-Xiang, Yang Can-Qun, Lu Ai-Ping, Hou Ting-Jun, Cao Dong-Sheng
Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, P. R. China.
Shangqiu Normal University, School of Information Technology, Shangqiu 476000, Henan, P. R. China.
Research (Wash D C). 2022 Dec 15;2022:0004. doi: 10.34133/research.0004. eCollection 2022.
Accurate prediction of pharmacological properties of small molecules is becoming increasingly important in drug discovery. Traditional feature-engineering approaches heavily rely on handcrafted descriptors and/or fingerprints, which need extensive human expert knowledge. With the rapid progress of artificial intelligence technology, data-driven deep learning methods have shown unparalleled advantages over feature-engineering-based methods. However, existing deep learning methods usually suffer from the scarcity of labeled data and the inability to share information between different tasks when applied to predicting molecular properties, thus resulting in poor generalization capability. Here, we proposed a novel multitask learning BERT (Bidirectional Encoder Representations from Transformer) framework, named MTL-BERT, which leverages large-scale pre-training, multitask learning, and SMILES (simplified molecular input line entry specification) enumeration to alleviate the data scarcity problem. MTL-BERT first exploits a large amount of unlabeled data through self-supervised pretraining to mine the rich contextual information in SMILES strings and then fine-tunes the pretrained model for multiple downstream tasks simultaneously by leveraging their shared information. Meanwhile, SMILES enumeration is used as a data enhancement strategy during the pretraining, fine-tuning, and test phases to substantially increase data diversity and help to learn the key relevant patterns from complex SMILES strings. The experimental results showed that the pretrained MTL-BERT model with few additional fine-tuning can achieve much better performance than the state-of-the-art methods on most of the 60 practical molecular datasets. Additionally, the MTL-BERT model leverages attention mechanisms to focus on SMILES character features essential to target properties for model interpretability.
在药物发现中,准确预测小分子的药理特性变得越来越重要。传统的特征工程方法严重依赖手工制作的描述符和/或指纹,这需要广泛的人类专家知识。随着人工智能技术的快速发展,数据驱动的深度学习方法相对于基于特征工程的方法展现出了无与伦比的优势。然而,现有的深度学习方法在应用于预测分子特性时,通常面临标记数据稀缺以及不同任务之间无法共享信息的问题,从而导致泛化能力较差。在此,我们提出了一种新颖的多任务学习BERT(来自Transformer的双向编码器表示)框架,名为MTL-BERT,它利用大规模预训练、多任务学习和SMILES(简化分子输入线性条目规范)枚举来缓解数据稀缺问题。MTL-BERT首先通过自监督预训练利用大量未标记数据来挖掘SMILES字符串中的丰富上下文信息,然后通过利用多个下游任务的共享信息同时对预训练模型进行微调。同时,SMILES枚举在预训练、微调及测试阶段用作数据增强策略,以大幅增加数据多样性,并有助于从复杂的SMILES字符串中学习关键相关模式。实验结果表明,经过很少额外微调的预训练MTL-BERT模型在60个实际分子数据集的大多数数据集上,比现有最先进的方法能取得更好的性能。此外,MTL-BERT模型利用注意力机制专注于对目标特性至关重要的SMILES字符特征,以实现模型的可解释性。