Ramakrishnan Cartic, Patnia Abhishek, Hovy Eduard, Burns Gully Apc
Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Suite 1001, Marina del Rey, CA, 90292-6695, USA.
Source Code Biol Med. 2012 May 28;7(1):7. doi: 10.1186/1751-0473-7-7.
The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the 'Layout-Aware PDF Text Extraction' (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications.
Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement.
LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at http://code.google.com/p/lapdftext/.
便携式文档格式(PDF)是在线科学出版物中最常用的文件格式。缺乏以布局感知方式从这些PDF文件中提取文本的有效手段,给将已发表文献作为信息源的生物医学文本挖掘或生物编目信息系统的开发者带来了重大挑战。在本文中,我们介绍了“布局感知PDF文本提取”(LA-PDFText)系统,以促进从研究文章的PDF文件中准确提取文本,用于文本挖掘应用。
我们的论文描述了一个开源系统的构建和性能,该系统从PDF格式的全文研究文章中提取文本块,并根据表征特定部分的规则将它们分类为逻辑单元。LA-PDFText系统仅关注研究文章的文本内容,旨在作为进一步实验更高级提取方法(处理多模态内容,如图像和图表)的基线。该系统按三个阶段工作:(1)使用空间布局处理检测连续文本块,以定位和识别连续文本块;(2)使用基于规则的方法将文本块分类为修辞类别;(3)以正确顺序将分类后的文本块拼接在一起,从而从按部分分组的块中提取文本。我们表明,我们的系统可以识别文本块并将它们分类为修辞类别,精确率为0.96%,召回率为0.89%,F1值为0.91%。我们还对步骤2中使用的块检测算法的准确性进行了评估。此外,我们将LA-PDFText提取的文本的准确性与来自PubMed Central开放获取子集的文本进行了比较。然后,我们将此准确性与通常用于从PDF中提取文本的PDF2Text系统提取的文本的准确性进行了比较。最后,我们讨论了我们系统的初步错误分析,并确定了进一步的改进领域。
LA-PDFText是一个用于从全文科学文章中准确提取文本的开源工具。该系统可在http://code.google.com/p/lapdftext/上获取。