PDF实体注释工具（PEAT）。

PDF Entity Annotation Tool (PEAT).

作者信息

Stahl Christopher G, Markey Kristan J, Jewell Brian C, Shams Dahnish, Taylor Michele M, Wilkins A Amina, Watford Sean, Shapiro Andy, Angrish Michelle

机构信息

Oak Ridge National Laboratory, USA.

Office of Research and Development. United States Environmental Protection Agency.

出版信息

J Open Source Softw. 2025 Apr 8;10(108):5336. doi: 10.21105/joss.05336.

DOI:10.21105/joss.05336

PMID:40547228

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12180754/

Abstract

While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats. Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes. Automatic extraction of data and information from formats such as PDFs that are optimized for layout and human readability, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. One of the main issues with traditional annotation tools is that they require transforming the PDF into plain text to facilitate the annotation process. While doing so lessens the technical challenges of annotating data, the user loses all structure and provenance that was inherent in the underlying PDF. Also, textual data extraction from PDFs can be an error prone process. Challenges include identifying sequential blocks of text and a multitude of document formats (multiple columns, font encodings, etc.). As a result of these challenges, using existing tools for development of NLP/ML models directly from PDFs is difficult because the generated outputs are not interoperable. We created a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning.

摘要

虽然包括使用人工智能（AI）和其他基于机器的方法在内的不同文本挖掘方法继续快速扩展，但研究人员用于创建训练、建模和评估所需的标记数据集的工具仍然很简陋。标记数据集包含机器要学习的目标属性；例如，训练一个算法来区分汽车或卡车的图像通常需要一组带有对每种车辆类型潜在特征定量描述的图像。目前，可用于构建科学文献自然语言机器学习模型的标记文本数据的开发尚未整合到领域专家使用的现有手动工作流程中。已发表的文献包含丰富的重要信息，如不同类型的嵌入式文本、图表和表格，当以机器可读格式提取和准备时，这些都可以用作训练机器学习/自然语言处理（NLP）模型的输入。目前，对领域专家有用的规范化数据提取以及支持机器学习/自然语言处理模型开发的提取都是劳动密集型且繁琐的手动过程。从诸如PDF等为布局和人类可读性而非机器可读性优化的格式中自动提取数据和信息。PDF（便携式文档格式）实体注释工具（PEAT）的开发目标是允许用户在当前打印格式内注释出版物，同时也允许以机器可读格式捕获这些注释。传统注释工具的主要问题之一是它们需要将PDF转换为纯文本以方便注释过程。虽然这样做减少了注释数据的技术挑战，但用户会丢失基础PDF中固有的所有结构和出处。此外，从PDF中提取文本数据可能是一个容易出错的过程。挑战包括识别连续的文本块和多种文档格式（多列、字体编码等）。由于这些挑战，直接使用现有工具从PDF开发NLP/ML模型很困难，因为生成的输出不可互操作。我们创建了一个系统，允许在原始PDF文档结构上完成注释，而无需提取纯文本。结果是一个允许更轻松、更准确注释的应用程序。此外，通过包含一个允许用户轻松创建模式的功能，我们开发了一个可用于为与主题专家相关的不同领域中心模式注释文本的系统。不同的知识领域需要不同的模式和注释标签来支持机器学习。