Department of Automation, Tsinghua University, Beijing National Research Center for Information Science and Technology, Beijing 100084, P. R. China.
Chem Commun (Camb). 2021 Mar 14;57(21):2633-2636. doi: 10.1039/d0cc07384a. Epub 2021 Feb 15.
Establishing quantitative structure-property relationships for the rational design of small molecule drugs at the early discovery stage is highly desirable. Using natural language processing (NLP), we proposed a machine learning model to process the line notation of small organic molecules, allowing the prediction of their melting points. The model prediction accuracy benefits from training upon different canonicalized SMILES forms of the same molecules and does not decrease with increasing size, complexity, and structural flexibility. When a combination of two different canonicalized SMILES forms is used to train the model, the prediction accuracy improves. Largely distinguished from the previous fragment-based or descriptor-based models, the prediction accuracy of this NLP-based model does not decrease with increasing size, complexity, and structural flexibility of molecules. By representing the chemical structure as a natural language, this NLP-based model offers a potential tool for quantitative structure-property prediction for drug discovery and development.
在早期发现阶段,建立小分子药物的定量构效关系对于合理设计小分子药物是非常理想的。本研究利用自然语言处理(NLP),提出了一种机器学习模型来处理小分子有机分子的线式符号表示,从而可以预测其熔点。该模型的预测准确性受益于对同一分子不同规范化 SMILES 形式的训练,并且不会随着分子大小、复杂性和结构灵活性的增加而降低。当使用两种不同规范化 SMILES 形式的组合来训练模型时,预测准确性会提高。与之前基于片段或描述符的模型有很大的不同,这个基于 NLP 的模型的预测准确性不会随着分子大小、复杂性和结构灵活性的增加而降低。通过将化学结构表示为自然语言,这个基于 NLP 的模型为药物发现和开发中的定量构效关系预测提供了一种潜在的工具。