从临床 PDF 文档中无监督提取正文。

Unsupervised Extraction of Body-Text from Clinical PDF Documents.

机构信息

Division of Medical Information Sciences, Geneva University Hospitals, Switzerland.

Department of Radiology and Medical Informatics, University of Geneva, Switzerland.

出版信息

Stud Health Technol Inform. 2024 Aug 22;316:214-215. doi: 10.3233/SHTI240382.

DOI:10.3233/SHTI240382

PMID:39176711

Abstract

Automatic extraction of body-text within clinical PDF documents is necessary to enhance downstream NLP tasks but remains a challenge. This study presents an unsupervised algorithm designed to extract body-text leveraging large volume of data. Using DBSCAN clustering over aggregate pages, our method extracts and organize text blocks using their content and coordinates. Evaluation results demonstrate precision scores ranging from 0.82 to 0.98, recall scores from 0.62 to 0.94, and F1-scores from 0.71 to 0.96 across various medical specialty sources. Future work includes dynamic parameter adjustments for improved accuracy and using larger datasets.

摘要

自动从临床 PDF 文档中提取正文对于增强下游的自然语言处理任务是必要的，但这仍然是一个挑战。本研究提出了一种利用大量数据提取正文的无监督算法。我们的方法通过在聚合页面上进行 DBSCAN 聚类，使用其内容和坐标来提取和组织文本块。评估结果表明，在各种医学专业来源中，精度得分范围为 0.82 到 0.98，召回率得分范围为 0.62 到 0.94，F1 得分范围为 0.71 到 0.96。未来的工作包括动态调整参数以提高准确性和使用更大的数据集。