Park Yang Jeong, Jerng Sung Eun, Yoon Sungroh, Li Ju
Massachusetts Institute of Technology, Department of Nuclear Science and Engineering, Cambridge, 02139, USA.
Massachusetts Institute of Technology, Department of Materials Science and Engineering, Cambridge, 02139, USA.
Sci Data. 2024 Sep 28;11(1):1060. doi: 10.1038/s41597-024-03886-w.
The advent of artificial intelligence (AI) has enabled a comprehensive exploration of materials for various applications. However, AI models often prioritize frequently encountered material examples in the scientific literature, limiting the selection of suitable candidates based on inherent physical and chemical attributes. To address this imbalance, we generated a dataset consisting of 1,453,493 natural language-material narratives from OQMD, Materials Project, JARVIS, and AFLOW2 databases based on ab initio calculation results that are more evenly distributed across the periodic table. The generated text narratives were then scored by both human experts and GPT-4, based on three rubrics: technical accuracy, language and structure, and relevance and depth of content, showing similar scores but with human-scored depth of content being the most lagging. The integration of multimodal data sources and large language models holds immense potential for AI frameworks to aid the exploration and discovery of solid-state materials for specific applications of interest.
人工智能(AI)的出现使得人们能够全面探索适用于各种应用的材料。然而,AI模型通常优先考虑科学文献中经常出现的材料示例,这限制了基于固有物理和化学属性来选择合适的候选材料。为了解决这种不平衡,我们基于从头算计算结果生成了一个数据集,该数据集包含来自OQMD、材料项目、JARVIS和AFLOW2数据库的1,453,493条自然语言-材料叙述,这些叙述在元素周期表上的分布更加均匀。然后,由人类专家和GPT-4根据三个标准对生成的文本叙述进行评分:技术准确性、语言和结构,以及内容的相关性和深度,结果显示两者得分相似,但人类评分的内容深度最为滞后。多模态数据源和大语言模型的整合为AI框架助力探索和发现用于特定感兴趣应用的固态材料具有巨大潜力。