Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST) , 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea.
Laboratory of Molecular Simulation, Institut des Sciences et Ingénierie Chimiques, Valais, Ecole Polytechnique Fédérale de Lausanne (EPFL) , Rue de l'Industrie 17, CH-1951 Sion, Switzerland.
J Chem Inf Model. 2018 Feb 26;58(2):244-251. doi: 10.1021/acs.jcim.7b00608. Epub 2018 Jan 29.
We have developed a simple text mining algorithm that allows us to identify surface area and pore volumes of metal-organic frameworks (MOFs) using manuscript html files as inputs. The algorithm searches for common units (e.g., m/g, cm/g) associated with these two quantities to facilitate the search. From the sample set data of over 200 MOFs, the algorithm managed to identify 90% and 88.8% of the correct surface area and pore volume values. Further application to a test set of randomly chosen MOF html files yielded 73.2% and 85.1% accuracies for the two respective quantities. Most of the errors stem from unorthodox sentence structures that made it difficult to identify the correct data as well as bolded notations of MOFs (e.g., 1a) that made it difficult identify its real name. These types of tools will become useful when it comes to discovering structure-property relationships among MOFs as well as collecting a large set of data for references.
我们开发了一种简单的文本挖掘算法,该算法允许我们使用手稿的 HTML 文件作为输入来识别金属有机骨架 (MOF) 的比表面积和孔体积。该算法搜索与这两个数量相关的常见单位(例如 m/g、cm/g),以方便搜索。在超过 200 个 MOF 的样本数据集上,该算法成功识别了 90%和 88.8%的正确比表面积和孔体积值。进一步将该算法应用于随机选择的 MOF HTML 文件测试集,对于这两个数量,分别得到了 73.2%和 85.1%的准确率。大多数错误源于非标准的句子结构,这使得难以识别正确的数据,以及 MOF 的加粗标记(例如 1a),这使得难以识别其真实名称。当涉及到发现 MOF 之间的结构-性能关系以及收集大量数据作为参考时,这些类型的工具将变得非常有用。