Liu Sijia, Wang Yanshan, Wen Andrew, Wang Liwei, Hong Na, Shen Feichen, Bedrick Steven, Hersh William, Liu Hongfang
Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States.
Department of Computer Science and Electrical Engineering, Oregon Health & Science University, Portland, OR, United States.
JMIR Med Inform. 2020 Oct 6;8(10):e17376. doi: 10.2196/17376.
Widespread adoption of electronic health records has enabled the secondary use of electronic health record data for clinical research and health care delivery. Natural language processing techniques have shown promise in their capability to extract the information embedded in unstructured clinical data, and information retrieval techniques provide flexible and scalable solutions that can augment natural language processing systems for retrieving and ranking relevant records.
In this paper, we present the implementation of a cohort retrieval system that can execute textual cohort selection queries on both structured data and unstructured text-Cohort Retrieval Enhanced by Analysis of Text from Electronic Health Records (CREATE).
CREATE is a proof-of-concept system that leverages a combination of structured queries and information retrieval techniques on natural language processing results to improve cohort retrieval performance using the Observational Medical Outcomes Partnership Common Data Model to enhance model portability. The natural language processing component was used to extract common data model concepts from textual queries. We designed a hierarchical index to support the common data model concept search utilizing information retrieval techniques and frameworks.
Our case study on 5 cohort identification queries, evaluated using the precision at 5 information retrieval metric at both the patient-level and document-level, demonstrates that CREATE achieves a mean precision at 5 of 0.90, which outperforms systems using only structured data or only unstructured text with mean precision at 5 values of 0.54 and 0.74, respectively.
The implementation and evaluation of Mayo Clinic Biobank data demonstrated that CREATE outperforms cohort retrieval systems that only use one of either structured data or unstructured text in complex textual cohort queries.
电子健康记录的广泛采用使得电子健康记录数据能够被二次用于临床研究和医疗服务。自然语言处理技术在提取非结构化临床数据中嵌入的信息方面显示出了潜力,而信息检索技术提供了灵活且可扩展的解决方案,可增强自然语言处理系统以检索和排序相关记录。
在本文中,我们展示了一个队列检索系统的实现,该系统可以对结构化数据和非结构化文本执行文本队列选择查询——通过电子健康记录文本分析增强的队列检索(CREATE)。
CREATE是一个概念验证系统,它利用结构化查询和信息检索技术对自然语言处理结果的组合,使用观察性医疗结果合作组织通用数据模型来提高队列检索性能,以增强模型的可移植性。自然语言处理组件用于从文本查询中提取通用数据模型概念。我们设计了一个层次索引,以支持利用信息检索技术和框架进行通用数据模型概念搜索。
我们对5个队列识别查询的案例研究,使用患者级和文档级的5信息检索指标下的精确率进行评估,结果表明CREATE在5时的平均精确率为0.90,优于仅使用结构化数据或仅使用非结构化文本的系统,它们在5时的平均精确率分别为0.54和0.74。
梅奥诊所生物样本库数据的实现和评估表明,在复杂的文本队列查询中,CREATE优于仅使用结构化数据或非结构化文本之一的队列检索系统。