Department of Computer Science, Ben-Gurion University in the Negev, Beer-Sheva, Israel.
BMC Bioinformatics. 2013 Jan 16;14:10. doi: 10.1186/1471-2105-14-10.
The increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for phenotype extraction. Text-mining methods in particular can help disease modeling by mapping named-entities mentions to terminologies and clustering semantically related terms. EHR corpora, however, exhibit specific statistical and linguistic characteristics when compared with corpora in the biomedical literature domain. We focus on copy-and-paste redundancy: clinicians typically copy and paste information from previous notes when documenting a current patient encounter. Thus, within a longitudinal patient record, one expects to observe heavy redundancy. In this paper, we ask three research questions: (i) How can redundancy be quantified in large-scale text corpora? (ii) Conventional wisdom is that larger corpora yield better results in text mining. But how does the observed EHR redundancy affect text mining? Does such redundancy introduce a bias that distorts learned models? Or does the redundancy introduce benefits by highlighting stable and important subsets of the corpus? (iii) How can one mitigate the impact of redundancy on text mining?
We analyze a large-scale EHR corpus and quantify redundancy both in terms of word and semantic concept repetition. We observe redundancy levels of about 30% and non-standard distribution of both words and concepts. We measure the impact of redundancy on two standard text-mining applications: collocation identification and topic modeling. We compare the results of these methods on synthetic data with controlled levels of redundancy and observe significant performance variation. Finally, we compare two mitigation strategies to avoid redundancy-induced bias: (i) a baseline strategy, keeping only the last note for each patient in the corpus; (ii) removing redundant notes with an efficient fingerprinting-based algorithm. (a)For text mining, preprocessing the EHR corpus with fingerprinting yields significantly better results.
Before applying text-mining techniques, one must pay careful attention to the structure of the analyzed corpora. While the importance of data cleaning has been known for low-level text characteristics (e.g., encoding and spelling), high-level and difficult-to-quantify corpus characteristics, such as naturally occurring redundancy, can also hurt text mining. Fingerprinting enables text-mining techniques to leverage available data in the EHR corpus, while avoiding the bias introduced by redundancy.
电子健康记录 (EHR) 数据,特别是免费的患者笔记的可用性为表型提取提供了机会。文本挖掘方法尤其可以通过将命名实体提及映射到术语并对语义相关术语进行聚类来帮助疾病建模。然而,与生物医学文献领域的语料库相比,EHR 语料库具有特定的统计和语言特征。我们专注于复制和粘贴冗余:当记录当前患者就诊时,临床医生通常会复制和粘贴以前的笔记中的信息。因此,在纵向患者记录中,预计会观察到大量冗余。在本文中,我们提出了三个研究问题:(i) 如何在大规模文本语料库中量化冗余?(ii) 传统观点认为,更大的语料库可以在文本挖掘中产生更好的结果。但是,观察到的 EHR 冗余如何影响文本挖掘?这种冗余是否会引入扭曲学习模型的偏差?或者冗余是否通过突出语料库的稳定和重要子集来带来好处?(iii) 如何减轻冗余对文本挖掘的影响?
我们分析了一个大规模的 EHR 语料库,并从单词和语义概念重复的角度量化了冗余。我们观察到约 30%的冗余水平和单词和概念的非标准分布。我们衡量了冗余对两个标准文本挖掘应用的影响:搭配识别和主题建模。我们在具有受控冗余水平的合成数据上比较了这些方法的结果,并观察到显著的性能变化。最后,我们比较了两种避免冗余引起的偏差的缓解策略:(i) 基线策略,仅保留语料库中每个患者的最后一条笔记;(ii) 使用高效的基于指纹的算法删除冗余笔记。(a)对于文本挖掘,使用指纹预处理 EHR 语料库会产生显著更好的结果。
在应用文本挖掘技术之前,必须仔细注意分析语料库的结构。虽然低级别文本特征(例如编码和拼写)的数据清理的重要性已为人所知,但难以量化的高级和语料库特征,例如自然发生的冗余,也可能会损害文本挖掘。指纹识别使文本挖掘技术能够利用 EHR 语料库中的可用数据,同时避免冗余带来的偏差。