临床记录中编号的复杂性、变化性和错误：对信息提取和队列识别的潜在影响。

Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification.

机构信息

Department of Pediatrics, University of Michigan, Ann Arbor, MI, 48109, USA.

School of Information, University of Michigan, Ann Arbor, MI, 48109, USA.

出版信息

BMC Med Inform Decis Mak. 2019 Apr 4;19(Suppl 3):75. doi: 10.1186/s12911-019-0784-1.

DOI:10.1186/s12911-019-0784-1

PMID:30944012

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6448181/

Abstract

BACKGROUND

Numbers and numerical concepts appear frequently in free text clinical notes from electronic health records. Knowledge of the frequent lexical variations of these numerical concepts, and their accurate identification, is important for many information extraction tasks. This paper describes an analysis of the variation in how numbers and numerical concepts are represented in clinical notes.

METHODS

We used an inverted index of approximately 100 million notes to obtain the frequency of various permutations of numbers and numerical concepts, including the use of Roman numerals, numbers spelled as English words, and invalid dates, among others. Overall, twelve types of lexical variants were analyzed.

RESULTS

We found substantial variation in how these concepts were represented in the notes, including multiple data quality issues. We also demonstrate that not considering these variations could have substantial real-world implications for cohort identification tasks, with one case missing > 80% of potential patients.

CONCLUSIONS

Numbering within clinical notes can be variable, and not taking these variations into account could result in missing or inaccurate information for natural language processing and information retrieval tasks.

摘要

背景

电子健康记录中的自由文本临床记录中经常出现数字和数字概念。了解这些数字概念的常见词汇变化及其准确识别对于许多信息提取任务非常重要。本文描述了对数字和数字概念在临床记录中的表示方式的变化进行的分析。

方法

我们使用了大约 1 亿条记录的倒排索引来获取数字和数字概念的各种排列的频率，包括使用罗马数字、拼写为英语单词的数字以及无效日期等。总共分析了 12 种词汇变体。

结果

我们发现这些概念在记录中的表示方式存在很大差异，包括多个数据质量问题。我们还证明，如果不考虑这些变化，对于队列识别任务可能会产生实质性的现实影响，在一个案例中，超过 80%的潜在患者被遗漏。

结论

临床记录中的编号可能会有所不同，如果不考虑这些变化，可能会导致自然语言处理和信息检索任务中丢失或不准确的信息。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

临床记录中编号的复杂性、变化性和错误：对信息提取和队列识别的潜在影响。

Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

临床记录中编号的复杂性、变化性和错误：对信息提取和队列识别的潜在影响。

Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献