Glasby Lawson T, Gubsch Kristian, Bence Rosalee, Oktavian Rama, Isoko Kesler, Moosavi Seyed Mohamad, Cordiner Joan L, Cole Jason C, Moghadam Peyman Z
Department of Chemical and Biological Engineering, The University of Sheffield, Sheffield S1 3JD, U.K.
Chemical Engineering & Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada.
Chem Mater. 2023 May 18;35(11):4510-4524. doi: 10.1021/acs.chemmater.3c00788. eCollection 2023 Jun 13.
The vastness of materials space, particularly that which is concerned with metal-organic frameworks (MOFs), creates the critical problem of performing efficient identification of promising materials for specific applications. Although high-throughput computational approaches, including the use of machine learning, have been useful in rapid screening and rational design of MOFs, they tend to neglect descriptors related to their synthesis. One way to improve the efficiency of MOF discovery is to data-mine published MOF papers to extract the materials informatics knowledge contained within journal articles. Here, by adapting the chemistry-aware natural language processing tool, ChemDataExtractor (CDE), we generated an open-source database of MOFs focused on their synthetic properties: the DigiMOF database. Using the CDE web scraping package alongside the Cambridge Structural Database (CSD) MOF subset, we automatically downloaded 43,281 unique MOF journal articles, extracted 15,501 unique MOF materials, and text-mined over 52,680 associated properties including the synthesis method, solvent, organic linker, metal precursor, and topology. Additionally, we developed an alternative data extraction technique to obtain and transform the chemical names assigned to each CSD entry in order to determine linker types for each structure in the CSD MOF subset. This data enabled us to match MOFs to a list of known linkers provided by Tokyo Chemical Industry UK Ltd. (TCI) and analyze the cost of these important chemicals. This centralized, structured database reveals the MOF synthetic data embedded within thousands of MOF publications and contains further topology, metal type, accessible surface area, largest cavity diameter, pore limiting diameter, open metal sites, and density calculations for all 3D MOFs in the CSD MOF subset. The DigiMOF database and associated software are publicly available for other researchers to rapidly search for MOFs with specific properties, conduct further analysis of alternative MOF production pathways, and create additional parsers to search for additional desirable properties.
材料空间的广阔性,尤其是与金属有机框架(MOF)相关的部分,带来了一个关键问题,即如何高效识别适用于特定应用的有前景的材料。尽管包括机器学习在内的高通量计算方法在MOF的快速筛选和合理设计中很有用,但它们往往忽略了与其合成相关的描述符。提高MOF发现效率的一种方法是对已发表的MOF论文进行数据挖掘,以提取期刊文章中包含的材料信息学知识。在这里,通过改编化学感知自然语言处理工具ChemDataExtractor(CDE),我们生成了一个专注于MOF合成性质的开源数据库:DigiMOF数据库。使用CDE网络爬虫包以及剑桥结构数据库(CSD)的MOF子集,我们自动下载了43281篇独特的MOF期刊文章,提取了15501种独特的MOF材料,并对超过52680个相关性质进行了文本挖掘,包括合成方法、溶剂、有机连接体、金属前驱体和拓扑结构。此外,我们开发了一种替代数据提取技术,以获取并转换分配给每个CSD条目的化学名称,从而确定CSD MOF子集中每个结构的连接体类型。这些数据使我们能够将MOF与英国东京化学工业有限公司(TCI)提供的已知连接体列表进行匹配,并分析这些重要化学品的成本。这个集中的、结构化的数据库揭示了数千篇MOF出版物中嵌入的MOF合成数据,并包含了CSD MOF子集中所有3D MOF的进一步拓扑结构、金属类型、可及表面积、最大空腔直径、孔隙限制直径、开放金属位点和密度计算。DigiMOF数据库及相关软件可供其他研究人员公开使用,以便快速搜索具有特定性质的MOF,对替代MOF生产途径进行进一步分析,并创建额外的解析器以搜索其他所需性质。