Suppr超能文献

通过自动化促进临床研究:结合光学字符识别和自然语言处理。

Facilitating clinical research through automation: Combining optical character recognition with natural language processing.

机构信息

Department of Diabetes & Cancer Discovery Science, City of Hope, Duarte, CA, USA.

出版信息

Clin Trials. 2022 Oct;19(5):504-511. doi: 10.1177/17407745221093621. Epub 2022 May 24.

Abstract

BACKGROUND/AIMS: Performance status is crucial for most clinical research, as an eligibility criterion, a comorbidity covariate, or a trial endpoint. Yet information on performance status often is embedded as free text within a patient's electronic medical record, rather than coded directly, thereby making this concept extremely difficult to extract for research. Furthermore, performance status information frequently resides in outside reports, which are scanned into the electronic medical record along with thousands of clinic notes. The image format of scanned documents also is a major obstacle to the search and retrieval of information, as natural language processing cannot be applied to unstructured text within an image. We, therefore, utilized optical character recognition software to convert images to a searchable format, allowing the application of natural language processing to identify pertinent performance status data elements within scanned electronic medical records.

METHODS

Our study cohort consisted of 189 subjects diagnosed with diffuse large B-cell lymphoma for whom performance status was a required data element for analysis of prognostic factors related to recurrence and survival. Manual abstraction of performance status was previously conducted by a clinical Subject Matter Expert, serving as the gold standard. Leveraging our data warehouse, we extracted relevant scanned electronic medical record documents and applied optical character recognition to these images using the ABBYY FineReader software. The Linguamatics i2e natural language processing software was then used to run queries for performance status against the corpus of electronic medical record documents. We evaluated our optical character recognition/natural language processing pipeline for accuracy and reduction in data extraction effort.

RESULTS

We found that there was high accuracy and reduced time for extraction of performance status data by applying our optical character recognition/natural language processing pipeline. The transformed scanned documents from a random sample of patients yielded excellent precision, recall, and F score, with <1% incorrect results. Time savings from a second cohort showed that median time to review documents for patients with performance status data present was reduced by a third. The major time savings was in the review of those documents that in fact did not contain performance status information: median of 18 minutes versus 108 minutes for manual review, an 83% reduction in data abstraction effort.

CONCLUSION

By applying this optical character recognition/natural language processing pipeline, we achieved significant operational improvement and reduced time for information retrieval to support clinical research. Our study demonstrated that optical character recognition software provides an effective mechanism to transform scanned electronic medical record images to allow the application of natural language processing, yielding highly accurate data abstraction. We conclude that our optical character recognition/natural language processing pipeline can greatly facilitate research data abstraction by providing a highly focused data review, eliminating unnecessary manual review of the entire chart, and thus freeing time for abstracting other data elements requiring more human interpretation.

摘要

背景/目的:在大多数临床研究中,表现状态至关重要,它可以作为入选标准、合并症协变量或试验终点。然而,表现状态的信息通常嵌入在患者的电子病历中作为自由文本,而不是直接编码,这使得这个概念极难用于研究。此外,表现状态信息经常存在于外部报告中,这些报告与数千份诊所记录一起扫描到电子病历中。扫描文档的图像格式也是搜索和检索信息的主要障碍,因为自然语言处理不能应用于图像中的非结构化文本。因此,我们利用光学字符识别软件将图像转换为可搜索的格式,允许应用自然语言处理来识别扫描电子病历中相关的表现状态数据元素。

方法

我们的研究队列包括 189 名被诊断为弥漫性大 B 细胞淋巴瘤的患者,表现状态是分析与复发和生存相关的预后因素的必需数据元素。表现状态的手动提取先前由临床主题专家进行,作为金标准。利用我们的数据仓库,我们提取了相关的扫描电子病历文档,并使用 ABBYY FineReader 软件对这些图像进行光学字符识别。然后,我们使用 Linguamatics i2e 自然语言处理软件针对电子病历文档语料库运行表现状态查询。我们评估了我们的光学字符识别/自然语言处理管道的准确性和数据提取工作的减少量。

结果

我们发现,通过应用我们的光学字符识别/自然语言处理管道,表现状态数据的提取具有很高的准确性和减少的时间。从患者随机样本的转换扫描文档产生了极好的精度、召回率和 F 分数,错误结果<1%。来自第二个队列的时间节省表明,对于有表现状态数据的患者,审查文档的中位数时间减少了三分之一。主要的时间节省是在审查实际上不包含表现状态信息的文档上:中位数分别为 18 分钟和 108 分钟,数据提取工作减少了 83%。

结论

通过应用这种光学字符识别/自然语言处理管道,我们实现了显著的运营改进,并减少了信息检索时间,以支持临床研究。我们的研究表明,光学字符识别软件提供了一种有效的机制,可以将扫描的电子病历图像转换为允许应用自然语言处理的格式,从而生成高度准确的数据提取。我们得出结论,我们的光学字符识别/自然语言处理管道可以通过提供高度集中的数据审查来极大地促进研究数据提取,消除对整个图表进行不必要的手动审查的需要,从而为提取需要更多人工解释的数据元素腾出时间。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验