Suppr超能文献

一种基于BERT的预训练模型,用于从SMILES序列中提取分子结构信息。

A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence.

作者信息

Zheng Xiaofan, Tomiura Yoichi

机构信息

Graduate School of Information Science and Electrical Engineering, Department of Informatics, Kyushu University, Fukuoka, Japan.

出版信息

J Cheminform. 2024 Jun 19;16(1):71. doi: 10.1186/s13321-024-00848-7.

Abstract

Among the various molecular properties and their combinations, it is a costly process to obtain the desired molecular properties through theory or experiment. Using machine learning to analyze molecular structure features and to predict molecular properties is a potentially efficient alternative for accelerating the prediction of molecular properties. In this study, we analyze molecular properties through the molecular structure from the perspective of machine learning. We use SMILES sequences as inputs to an artificial neural network in extracting molecular structural features and predicting molecular properties. A SMILES sequence comprises symbols representing molecular structures. To address the problem that a SMILES sequence is different from actual molecular structural data, we propose a pretraining model for a SMILES sequence based on the BERT model, which is widely used in natural language processing, such that the model learns to extract the molecular structural information contained in the SMILES sequence. In an experiment, we first pretrain the proposed model with 100,000 SMILES sequences and then use the pretrained model to predict molecular properties on 22 data sets and the odor characteristics of molecules (98 types of odor descriptor). The experimental results show that our proposed pretraining model effectively improves the performance of molecular property prediction SCIENTIFIC CONTRIBUTION: The 2-encoder pretraining is proposed by focusing on the lower dependency of symbols to the contextual environment in a SMILES than one in a natural language sentence and the corresponding of one compound to multiple SMILES sequences. The model pretrained with 2-encoder shows higher robustness in tasks of molecular properties prediction compared to BERT which is adept at natural language.

摘要

在各种分子性质及其组合中,通过理论或实验获得所需的分子性质是一个成本高昂的过程。利用机器学习来分析分子结构特征并预测分子性质,是加速分子性质预测的一种潜在有效替代方法。在本研究中,我们从机器学习的角度通过分子结构来分析分子性质。我们使用SMILES序列作为人工神经网络的输入,以提取分子结构特征并预测分子性质。SMILES序列由表示分子结构的符号组成。为了解决SMILES序列与实际分子结构数据不同的问题,我们基于在自然语言处理中广泛使用的BERT模型,提出了一种用于SMILES序列的预训练模型,使该模型能够学习提取SMILES序列中包含的分子结构信息。在一项实验中,我们首先用100,000个SMILES序列对所提出的模型进行预训练,然后使用预训练的模型对22个数据集以及分子的气味特征(98种气味描述符)进行分子性质预测。实验结果表明,我们提出的预训练模型有效地提高了分子性质预测的性能。科学贡献:通过关注SMILES中符号相对于自然语言句子中符号对上下文环境的较低依赖性以及一个化合物对应多个SMILES序列的情况,提出了双编码器预训练。与擅长自然语言的BERT相比,用双编码器预训练的模型在分子性质预测任务中表现出更高的稳健性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6031/11186148/3656195b3c75/13321_2024_848_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验