Suppr超能文献

PDFDataExtractor:一种从可移植文档格式中的排版文献中读取科学文本和解释元数据的工具。

PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format.

机构信息

Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.

ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.

出版信息

J Chem Inf Model. 2022 Apr 11;62(7):1633-1643. doi: 10.1021/acs.jcim.1c01198. Epub 2022 Mar 29.

Abstract

The layout of portable document format (PDF) files is constant to any screen, and the metadata therein are latent, compared to mark-up languages such as HTML and XML. No semantic tags are usually provided, and a PDF file is not designed to be edited or its data interpreted by software. However, data held in PDF files need to be extracted in order to comply with open-source data requirements that are now government-regulated. In the chemical domain, related chemical and property data also need to be found, and their correlations need to be exploited to enable data science in areas such as data-driven materials discovery. Such relationships may be realized using text-mining software such as the "chemistry-aware" natural-language-processing tool, ChemDataExtractor; however, this tool has limited data-extraction capabilities from PDF files. This study presents the PDFDataExtractor tool, which can act as a plug-in to ChemDataExtractor. It outperforms other PDF-extraction tools for the chemical literature by coupling its functionalities to the chemical-named entity-recognition capabilities of ChemDataExtractor. The intrinsic PDF-reading abilities of ChemDataExtractor are much improved. The system features a template-based architecture. This enables semantic information to be extracted from the PDF files of scientific articles in order to reconstruct the logical structure of articles. While other existing PDF-extracting tools focus on quantity mining, this template-based system is more focused on quality mining on different layouts. PDFDataExtractor outputs information in JSON and plain text, including the metadata of a PDF file, such as paper title, authors, affiliation, email, abstract, keywords, journal, year, document object identifier (DOI), reference, and issue number. With a self-created evaluation article set, PDFDataExtractor achieved promising precision for all key assessed metadata areas of the document text.

摘要

便携式文档格式 (PDF) 文件的布局对于任何屏幕都是固定的,其中的元数据是潜在的,与 HTML 和 XML 等标记语言相比。通常不提供语义标记,并且 PDF 文件不是为了通过软件编辑或解释其数据而设计的。但是,为了符合现在政府规定的开源数据要求,需要提取 PDF 文件中保存的数据。在化学领域,还需要找到相关的化学和属性数据,并利用它们的相关性来实现数据科学,例如数据驱动的材料发现。可以使用文本挖掘软件(如具有“化学感知”功能的自然语言处理工具 ChemDataExtractor)来实现这些关系;但是,该工具从 PDF 文件中提取数据的能力有限。本研究介绍了 PDFDataExtractor 工具,它可以作为 ChemDataExtractor 的插件。它通过将功能与 ChemDataExtractor 的化学命名实体识别能力相结合,在化学文献方面的 PDF 提取能力超过了其他 PDF 提取工具。ChemDataExtractor 的内在 PDF 读取能力得到了极大的提高。该系统具有基于模板的架构。这使得可以从科学文章的 PDF 文件中提取语义信息,以重建文章的逻辑结构。虽然其他现有的 PDF 提取工具侧重于数量挖掘,但基于模板的系统更侧重于不同布局的质量挖掘。PDFDataExtractor 以 JSON 和纯文本形式输出信息,包括 PDF 文件的元数据,例如论文标题、作者、所属机构、电子邮件、摘要、关键字、期刊、年份、文档对象标识符 (DOI)、参考文献和期号。使用自行创建的评估文章集,PDFDataExtractor 在文档文本的所有关键评估元数据领域都达到了令人满意的精度。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0ad0/9049592/d4f00b507e96/ci1c01198_0002.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验