Department of Pediatrics, University of Michigan, Ann Arbor, MI, 48109, USA.
School of Information, University of Michigan, Ann Arbor, MI, 48109, USA.
BMC Med Inform Decis Mak. 2019 Apr 4;19(Suppl 3):75. doi: 10.1186/s12911-019-0784-1.
Numbers and numerical concepts appear frequently in free text clinical notes from electronic health records. Knowledge of the frequent lexical variations of these numerical concepts, and their accurate identification, is important for many information extraction tasks. This paper describes an analysis of the variation in how numbers and numerical concepts are represented in clinical notes.
We used an inverted index of approximately 100 million notes to obtain the frequency of various permutations of numbers and numerical concepts, including the use of Roman numerals, numbers spelled as English words, and invalid dates, among others. Overall, twelve types of lexical variants were analyzed.
We found substantial variation in how these concepts were represented in the notes, including multiple data quality issues. We also demonstrate that not considering these variations could have substantial real-world implications for cohort identification tasks, with one case missing > 80% of potential patients.
Numbering within clinical notes can be variable, and not taking these variations into account could result in missing or inaccurate information for natural language processing and information retrieval tasks.
电子健康记录中的自由文本临床记录中经常出现数字和数字概念。了解这些数字概念的常见词汇变化及其准确识别对于许多信息提取任务非常重要。本文描述了对数字和数字概念在临床记录中的表示方式的变化进行的分析。
我们使用了大约 1 亿条记录的倒排索引来获取数字和数字概念的各种排列的频率,包括使用罗马数字、拼写为英语单词的数字以及无效日期等。总共分析了 12 种词汇变体。
我们发现这些概念在记录中的表示方式存在很大差异,包括多个数据质量问题。我们还证明,如果不考虑这些变化,对于队列识别任务可能会产生实质性的现实影响,在一个案例中,超过 80%的潜在患者被遗漏。
临床记录中的编号可能会有所不同,如果不考虑这些变化,可能会导致自然语言处理和信息检索任务中丢失或不准确的信息。