Zheng Zhiling, Zhang Oufan, Borgs Christian, Chayes Jennifer T, Yaghi Omar M
Department of Chemistry, University of California, Berkeley, California 94720, United States.
Kavli Energy Nanoscience Institute, University of California, Berkeley, California 94720, United States.
J Am Chem Soc. 2023 Aug 16;145(32):18048-18062. doi: 10.1021/jacs.3c05819. Epub 2023 Aug 7.
We use prompt engineering to guide ChatGPT in the automation of text mining of metal-organic framework (MOF) synthesis conditions from diverse formats and styles of the scientific literature. This effectively mitigates ChatGPT's tendency to hallucinate information, an issue that previously made the use of large language models (LLMs) in scientific fields challenging. Our approach involves the development of a workflow implementing three different processes for text mining, programmed by ChatGPT itself. All of them enable parsing, searching, filtering, classification, summarization, and data unification with different trade-offs among labor, speed, and accuracy. We deploy this system to extract 26 257 distinct synthesis parameters pertaining to approximately 800 MOFs sourced from peer-reviewed research articles. This process incorporates our ChemPrompt Engineering strategy to instruct ChatGPT in text mining, resulting in impressive precision, recall, and F1 scores of 90-99%. Furthermore, with the data set built by text mining, we constructed a machine-learning model with over 87% accuracy in predicting MOF experimental crystallization outcomes and preliminarily identifying important factors in MOF crystallization. We also developed a reliable data-grounded MOF chatbot to answer questions about chemical reactions and synthesis procedures. Given that the process of using ChatGPT reliably mines and tabulates diverse MOF synthesis information in a unified format while using only narrative language requiring no coding expertise, we anticipate that our ChatGPT Chemistry Assistant will be very useful across various other chemistry subdisciplines.
我们利用提示工程来引导ChatGPT自动从各种格式和风格的科学文献中挖掘金属有机框架(MOF)的合成条件。这有效地减轻了ChatGPT产生幻觉信息的倾向,而这个问题此前使得在科学领域使用大型语言模型(LLM)具有挑战性。我们的方法包括开发一个工作流程,该流程实施三种不同的文本挖掘过程,由ChatGPT自身编程。所有这些过程都能够进行解析、搜索、过滤、分类、总结以及数据统一,在人工、速度和准确性之间进行不同的权衡。我们部署这个系统,从同行评审的研究文章中提取与大约800种MOF相关的26257个不同的合成参数。这个过程纳入了我们的化学提示工程策略,以指导ChatGPT进行文本挖掘,从而在精确率、召回率和F1分数方面取得了令人印象深刻的90 - 99%。此外,利用通过文本挖掘构建的数据集,我们构建了一个机器学习模型,在预测MOF实验结晶结果和初步识别MOF结晶中的重要因素方面,准确率超过87%。我们还开发了一个可靠的基于数据的MOF聊天机器人,用于回答有关化学反应和合成程序的问题。鉴于使用ChatGPT的过程能够以统一格式可靠地挖掘和整理各种MOF合成信息,同时仅使用无需编码专业知识的叙述性语言,我们预计我们的ChatGPT化学助手在其他各种化学子学科中将非常有用。