Cho Paul S, Taira Ricky K, Kangarloo Hooshang
Department of Radiation Oncology, University of Washington, Seattle, WA, USA.
AMIA Annu Symp Proc. 2003;2003:155-9.
Automated segmentation of medical reports can significantly enhance the productivity of the healthcare departments. While many algorithms have been developed for document summarization, passage retrieval, and story segmentation of news feeds, much less effort has been devoted to parsing of medical documents. We present an algorithm specifically developed for medical applications. The algorithm consists of two components. First, a rule-based algorithm is used to detect the sections that contain labels. It utilizes a knowledge base of commonly employed heading labels and linguistic cues seen within training examples. The second part of the algorithm handles the detection of unlabeled sections. It uses a combination of lexical pattern recognition and a classifier based on an expectation model for a particular class of medical reports. The proposed method was evaluated on three test corpora containing a total of 129,303 report sections. The detection rates for labeled and unlabeled sections for individual corpus ranged from 97.4% to 99.4% and from 96.5% to 99.0%, respectively. The rule-based approach is particularly effective for medical reports due to inherently structured nature of these documents.
医学报告的自动分割可以显著提高医疗部门的工作效率。虽然已经开发了许多算法用于文档摘要、段落检索和新闻源的故事分割,但在医学文档解析方面投入的精力要少得多。我们提出了一种专门为医学应用开发的算法。该算法由两个部分组成。首先,使用基于规则的算法来检测包含标签的部分。它利用了在训练示例中常见的标题标签和语言线索的知识库。算法的第二部分处理未标记部分的检测。它结合了词汇模式识别和基于特定类别的医学报告期望模型的分类器。该方法在三个测试语料库上进行了评估,这些语料库总共包含129,303个报告部分。单个语料库中标记和未标记部分的检测率分别为97.4%至99.4%和96.5%至99.0%。由于这些文档固有的结构化性质,基于规则的方法对医学报告特别有效。