Wilson Richard A, Chapman Wendy W, Defries Shawn J, Becich Michael J, Chapman Brian E
Department of Biomedical Informatics, University of Pittsburgh, 200 Meyran Avenue, Pittsburgh, PA; USA.
J Pathol Inform. 2010 Oct 11;1:24. doi: 10.4103/2153-3539.71065.
Clinical records are often unstructured, free-text documents that create information extraction challenges and costs. Healthcare delivery and research organizations, such as the National Mesothelioma Virtual Bank, require the aggregation of both structured and unstructured data types. Natural language processing offers techniques for automatically extracting information from unstructured, free-text documents.
Five hundred and eight history and physical reports from mesothelioma patients were split into development (208) and test sets (300). A reference standard was developed and each report was annotated by experts with regard to the patient's personal history of ancillary cancer and family history of any cancer. The Hx application was developed to process reports, extract relevant features, perform reference resolution and classify them with regard to cancer history. Two methods, Dynamic-Window and ConText, for extracting information were evaluated. Hx's classification responses using each of the two methods were measured against the reference standard. The average Cohen's weighted kappa served as the human benchmark in evaluating the system.
Hx had a high overall accuracy, with each method, scoring 96.2%. F-measures using the Dynamic-Window and ConText methods were 91.8% and 91.6%, which were comparable to the human benchmark of 92.8%. For the personal history classification, Dynamic-Window scored highest with 89.2% and for the family history classification, ConText scored highest with 97.6%, in which both methods were comparable to the human benchmark of 88.3% and 97.2%, respectively.
We evaluated an automated application's performance in classifying a mesothelioma patient's personal and family history of cancer from clinical reports. To do so, the Hx application must process reports, identify cancer concepts, distinguish the known mesothelioma from ancillary cancers, recognize negation, perform reference resolution and determine the experiencer. Results indicated that both information extraction methods tested were dependant on the domain-specific lexicon and negation extraction. We showed that the more general method, ConText, performed as well as our task-specific method. Although Dynamic- Window could be modified to retrieve other concepts, ConText is more robust and performs better on inconclusive concepts. Hx could greatly improve and expedite the process of extracting data from free-text, clinical records for a variety of research or healthcare delivery organizations.
临床记录通常是无结构的自由文本文件,这给信息提取带来了挑战并增加了成本。医疗保健服务和研究机构,如国家间皮瘤虚拟数据库,需要整合结构化和非结构化数据类型。自然语言处理提供了从无结构的自由文本文件中自动提取信息的技术。
将508份间皮瘤患者的病史和体格检查报告分为开发集(208份)和测试集(300份)。制定了参考标准,专家对每份报告就患者的辅助癌症个人史和任何癌症的家族史进行注释。开发了Hx应用程序来处理报告、提取相关特征、进行指代消解并对癌症病史进行分类。评估了两种信息提取方法,即动态窗口法和上下文法。将使用这两种方法的Hx分类响应与参考标准进行比较。平均科恩加权kappa值作为评估该系统的人工基准。
Hx的总体准确率很高,每种方法的得分均为96.2%。使用动态窗口法和上下文法的F值分别为91.8%和91.6%,与人工基准的92.8%相当。在个人史分类方面,动态窗口法得分最高,为89.2%;在家族史分类方面,上下文法得分最高为97.6%,两种方法分别与人工基准的88.3%和97.2%相当。
我们评估了一个自动化应用程序从临床报告中对间皮瘤患者的个人和家族癌症史进行分类的性能。为此,Hx应用程序必须处理报告、识别癌症概念、将已知的间皮瘤与辅助癌症区分开来、识别否定词、进行指代消解并确定经历者。结果表明,测试的两种信息提取方法都依赖于特定领域的词汇表和否定词提取。我们表明,更通用的方法上下文法与我们的特定任务方法表现相当。虽然动态窗口法可以修改以检索其他概念,但上下文法更稳健,在不确定的概念上表现更好。Hx可以极大地改进和加快从自由文本临床记录中提取数据的过程,适用于各种研究或医疗保健服务组织。