Suppr超能文献

ChemDataExtractor 2.0:材料科学自动填充本体。

ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science.

机构信息

Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.

Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K.

出版信息

J Chem Inf Model. 2021 Sep 27;61(9):4280-4289. doi: 10.1021/acs.jcim.1c00446. Epub 2021 Sep 16.

Abstract

The ever-growing abundance of data found in heterogeneous sources, such as scientific publications, has forced the development of automated techniques for data extraction. While in the past, in the physical sciences domain, the focus has been on the precise extraction of individual properties, attention has recently been devoted to the extraction of higher-level relationships. Here, we present a framework for an automated population of ontologies. That is, the direct extraction of a larger group of properties linked by a semantic network. We exploit data-rich sources, such as tables within documents, and present a new model concept that enables data extraction for chemical and physical properties with the ability to organize hierarchical data as nested information. Combining these capabilities with automatically generated parsers for data extraction and forward-looking interdependency resolution, we illustrate the power of our approach via the automatic extraction of a crystallographic hierarchy of information. This includes 18 interrelated submodels of nested data, extracted from an evaluation set of scientific articles, yielding an overall precision of 92.2%, across 26 different journals. Our method and associated toolkit, ChemDataExtractor 2.0, offers a key step toward the seamless integration of primary literature sources into a data-driven scientific framework.

摘要

不断增长的异质数据源(如科学出版物)中的数据丰富度,迫使人们开发自动化的数据提取技术。过去,物理科学领域的重点是精确提取单个属性,而最近的注意力则集中在提取更高层次的关系上。在这里,我们提出了一个用于本体自动填充的框架。也就是说,直接提取通过语义网络链接的更大属性组。我们利用富数据来源,如文档中的表格,并提出了一种新的模型概念,该概念能够以嵌套信息的形式组织分层数据,用于提取化学和物理性质的数据。通过自动生成用于数据提取和前瞻性依赖关系解析的解析器,结合这些功能,我们通过自动提取晶体学信息层次结构来说明我们方法的强大功能。这包括从科学文章评估集中提取的 18 个嵌套数据的相关子模型,在 26 种不同的期刊中,总体精度达到 92.2%。我们的方法和相关工具包 ChemDataExtractor 2.0 为无缝集成主要文献来源到数据驱动的科学框架提供了关键步骤。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验