Vist Gunn E, Husøy Trine, Diemar Michael Guy, Dirven Hubert, Roggen Erwin L, Kalyva Maria E
Norwegian Institute of Public Health, Division for Health Services, Oslo, Norway.
Norwegian Institute of Public Health, Division of Climate and Environmental Health, Oslo, Norway.
Comput Methods Programs Biomed. 2025 Oct;270:108962. doi: 10.1016/j.cmpb.2025.108962. Epub 2025 Jul 12.
Systematic reviews are widely used to identify the evidence and get an overview of the available knowledge for various questions related to public health and medical topics. They can provide a summary of all the available data and can be used to make knowledge-based decisions about policy, practice, and academic research. The conduct of systematic reviews can often be time-consuming and costly.
We have developed a command-line based code in R to extract data in an automated manner from full-text scientific papers. ExtractPDF is a data extraction tool/software that provides a reliable computational workflow for extracting words or combinations of words from numerous portable document format (PDF) files.
The software was applied to extract information from 299 papers that have been screened as included for a published systematic scoping review study within the field of risk assessment in public health. The output of the software is tables of extracted information per type of information of interest per PDF file. The tables were used during the data extraction stage as a second reviewer alongside a human reviewer to assist and/or validate data extraction items.
ExtractPDF tool has a novel pipeline architecture to automate extraction of information from unstructured format types, such as PDF files. ExtractPDF tool assisted in expediting the task of data extraction stage and reducing human related resources as well as errors. The tool's performance and reliability were found to be very good with metrics of averagely 0.89 for precision, 0.92 for recall, 0.86 for accuracy and 0.91for F1-score.