Xie Tong, Wan Yuwei, Zhou Yufei, Huang Wei, Liu Yixuan, Linghu Qingyuan, Wang Shaozhou, Kit Chunyu, Grazian Clara, Zhang Wenjie, Hoex Bram
School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW, Australia.
GreenDynamics Pty. Ltd, Kensington, NSW, Australia.
Patterns (N Y). 2024 Mar 22;5(5):100955. doi: 10.1016/j.patter.2024.100955. eCollection 2024 May 10.
Materials scientists usually collect experimental data to summarize experiences and predict improved materials. However, a crucial issue is how to proficiently utilize unstructured data to update existing structured data, particularly in applied disciplines. This study introduces a new natural language processing (NLP) task called structured information inference (SII) to address this problem. We propose an end-to-end approach to summarize and organize the multi-layered device-level information from the literature into structured data. After comparing different methods, we fine-tuned LLaMA with an F1 score of 87.14% to update an existing perovskite solar cell dataset with articles published since its release, allowing its direct use in subsequent data analysis. Using structured information, we developed regression tasks to predict the electrical performance of solar cells. Our results demonstrate comparable performance to traditional machine-learning methods without feature selection and highlight the potential of large language models for scientific knowledge acquisition and material development.
材料科学家通常收集实验数据以总结经验并预测性能更优的材料。然而,一个关键问题是如何有效地利用非结构化数据来更新现有的结构化数据,尤其是在应用学科中。本研究引入了一种名为结构化信息推理(SII)的新自然语言处理(NLP)任务来解决这一问题。我们提出了一种端到端的方法,将文献中的多层器件级信息进行总结和整理,转化为结构化数据。在比较了不同方法之后,我们以87.14%的F1分数对LLaMA进行了微调,以更新现有的钙钛矿太阳能电池数据集,纳入自其发布以来发表的文章,以便直接用于后续的数据分析。利用结构化信息,我们开发了回归任务来预测太阳能电池的电性能。我们的结果表明,在不进行特征选择的情况下,其性能与传统机器学习方法相当,并突出了大语言模型在科学知识获取和材料开发方面的潜力。