Wang Xiaochao, Zhang Wanli, Zhang Wenxu
School of Integrated Circuits Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, P.R. China.
J Chem Inf Model. 2024 Aug 12;64(15):5931-5943. doi: 10.1021/acs.jcim.4c00282. Epub 2024 Jul 23.
Vast published dielectric ceramics literature is a natural database for big-data analysis, discovering structure-property relationships, and property prediction. We constructed a data-mining pipeline based on natural language processing (NLP) to extract property information from about 12,900 published dielectric ceramics articles and normalized more than 20 properties. The micro-F1 scores for sentence classification, named entities recognition, relation extraction (related), and relation extraction (same), are 91.6, 82.4, 91.4, and 88.3%, respectively. We demonstrated the distribution of some essential properties according to the publication years to reveal the tendency. In order to test the reliability of the data extraction, we trained an XGBoost model to predict the dielectric constant and used the SHAP module to interpret the contribution of each feature in order to identify some of the factors that determine the dielectric properties. The result shows that including × in the model can increase the dielectric constant prediction accuracy. Our work can give some hints to experimentalists on their way to improve the performances of cutting-edge materials.
大量已发表的介电陶瓷文献是用于大数据分析、发现结构-性能关系和性能预测的天然数据库。我们构建了一个基于自然语言处理(NLP)的数据挖掘管道,从约12900篇已发表的介电陶瓷文章中提取性能信息,并对20多种性能进行了归一化处理。句子分类、命名实体识别、关系提取(相关)和关系提取(相同)的微观F1分数分别为91.6%、82.4%、91.4%和88.3%。我们根据出版年份展示了一些基本性能的分布情况,以揭示其趋势。为了测试数据提取的可靠性,我们训练了一个XGBoost模型来预测介电常数,并使用SHAP模块解释每个特征的贡献,以便识别一些决定介电性能的因素。结果表明,在模型中纳入×可以提高介电常数的预测精度。我们的工作可以为实验人员在提高前沿材料性能的道路上提供一些启示。