Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China.
PingAn Technology, Beijing 100027, China.
Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae534.
Natural language is poised to become a key medium for human-machine interactions in the era of large language models. In the field of biochemistry, tasks such as property prediction and molecule mining are critically important yet technically challenging. Bridging molecular expressions in natural language and chemical language can significantly enhance the interpretability and ease of these tasks. Moreover, it can integrate chemical knowledge from various sources, leading to a deeper understanding of molecules.
Recognizing these advantages, we introduce the concept of conversational molecular design, a novel task that utilizes natural language to describe and edit target molecules. To better accomplish this task, we develop ChatMol, a knowledgeable and versatile generative pretrained model. This model is enhanced by incorporating experimental property information, molecular spatial knowledge, and the associations between natural and chemical languages. Several typical solutions including large language models (e.g. ChatGPT) are evaluated, proving the challenge of conversational molecular design and the effectiveness of our knowledge enhancement approach. Case observations and analysis offer insights and directions for further exploration of natural-language interaction in molecular discovery.
Codes and data are provided in https://github.com/Ellenzzn/ChatMol/tree/main.
在大型语言模型时代,自然语言有望成为人机交互的主要媒介。在生物化学领域,属性预测和分子挖掘等任务至关重要,但技术上具有挑战性。弥合自然语言和化学语言中的分子表达,可以显著提高这些任务的可解释性和易用性。此外,它还可以整合来自各种来源的化学知识,从而更深入地了解分子。
认识到这些优势,我们引入了对话式分子设计的概念,这是一项利用自然语言描述和编辑目标分子的新任务。为了更好地完成这项任务,我们开发了 ChatMol,这是一个知识渊博且多功能的生成式预训练模型。通过整合实验性质信息、分子空间知识以及自然语言和化学语言之间的联系,对该模型进行了增强。我们评估了包括大型语言模型(例如 ChatGPT)在内的几个典型解决方案,证明了对话式分子设计的挑战性和我们的知识增强方法的有效性。案例观察和分析为进一步探索分子发现中的自然语言交互提供了思路和方向。
代码和数据可在 https://github.com/Ellenzzn/ChatMol/tree/main 获得。