Suppr超能文献

一种改进的用于挖掘自由文本电子病历的计算机辅助技术的验证

Validation of an Improved Computer-Assisted Technique for Mining Free-Text Electronic Medical Records.

作者信息

Duz Marco, Marshall John F, Parkin Tim

机构信息

School of Veterinary Medicine and Science, University of Nottingham, Loughborough, United Kingdom.

School of Veterinary Medicine, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom.

出版信息

JMIR Med Inform. 2017 Jun 29;5(2):e17. doi: 10.2196/medinform.7123.

Abstract

BACKGROUND

The use of electronic medical records (EMRs) offers opportunity for clinical epidemiological research. With large EMR databases, automated analysis processes are necessary but require thorough validation before they can be routinely used.

OBJECTIVE

The aim of this study was to validate a computer-assisted technique using commercially available content analysis software (SimStat-WordStat v.6 (SS/WS), Provalis Research) for mining free-text EMRs.

METHODS

The dataset used for the validation process included life-long EMRs from 335 patients (17,563 rows of data), selected at random from a larger dataset (141,543 patients, ~2.6 million rows of data) and obtained from 10 equine veterinary practices in the United Kingdom. The ability of the computer-assisted technique to detect rows of data (cases) of colic, renal failure, right dorsal colitis, and non-steroidal anti-inflammatory drug (NSAID) use in the population was compared with manual classification. The first step of the computer-assisted analysis process was the definition of inclusion dictionaries to identify cases, including terms identifying a condition of interest. Words in inclusion dictionaries were selected from the list of all words in the dataset obtained in SS/WS. The second step consisted of defining an exclusion dictionary, including combinations of words to remove cases erroneously classified by the inclusion dictionary alone. The third step was the definition of a reinclusion dictionary to reinclude cases that had been erroneously classified by the exclusion dictionary. Finally, cases obtained by the exclusion dictionary were removed from cases obtained by the inclusion dictionary, and cases from the reinclusion dictionary were subsequently reincluded using Rv3.0.2 (R Foundation for Statistical Computing, Vienna, Austria). Manual analysis was performed as a separate process by a single experienced clinician reading through the dataset once and classifying each row of data based on the interpretation of the free-text notes. Validation was performed by comparison of the computer-assisted method with manual analysis, which was used as the gold standard. Sensitivity, specificity, negative predictive values (NPVs), positive predictive values (PPVs), and F values of the computer-assisted process were calculated by comparing them with the manual classification.

RESULTS

Lowest sensitivity, specificity, PPVs, NPVs, and F values were 99.82% (1128/1130), 99.88% (16410/16429), 94.6% (223/239), 100.00% (16410/16412), and 99.0% (100×2×0.983×0.998/[0.983+0.998]), respectively. The computer-assisted process required few seconds to run, although an estimated 30 h were required for dictionary creation. Manual classification required approximately 80 man-hours.

CONCLUSIONS

The critical step in this work is the creation of accurate and inclusive dictionaries to ensure that no potential cases are missed. It is significantly easier to remove false positive terms from a SS/WS selected subset of a large database than search that original database for potential false negatives. The benefits of using this method are proportional to the size of the dataset to be analyzed.

摘要

背景

电子病历(EMR)的使用为临床流行病学研究提供了机会。对于大型电子病历数据库,自动化分析流程是必要的,但在常规使用之前需要进行全面验证。

目的

本研究旨在验证一种使用商用内容分析软件(SimStat-WordStat v.6(SS/WS),Provalis Research)挖掘自由文本电子病历的计算机辅助技术。

方法

用于验证过程的数据集包括从更大的数据集(141,543例患者,约260万行数据)中随机选取的335例患者的终身电子病历(17,563行数据),这些数据来自英国的10家马兽医诊所。将计算机辅助技术在人群中检测腹痛、肾衰竭、右背结肠炎和非甾体抗炎药(NSAID)使用的数据行(病例)的能力与人工分类进行比较。计算机辅助分析过程的第一步是定义包含词典以识别病例,包括识别感兴趣病症的术语。包含词典中的单词从SS/WS中获得的数据集中所有单词的列表中选择。第二步包括定义排除词典,包括用于去除仅由包含词典错误分类的病例的单词组合。第三步是定义重新纳入词典,以重新纳入被排除词典错误分类的病例。最后,从包含词典获得的病例中去除排除词典获得的病例,并使用Rv3.0.2(R统计计算基金会,奥地利维也纳)将重新纳入词典中的病例随后重新纳入。人工分析由一位经验丰富的临床医生单独进行,该医生通读数据集一次,并根据对自由文本注释的解释对每行数据进行分类。通过将计算机辅助方法与作为金标准的人工分析进行比较来进行验证。通过将计算机辅助过程与人工分类进行比较,计算其敏感性、特异性、阴性预测值(NPV)、阳性预测值(PPV)和F值。

结果

最低敏感性、特异性、PPV、NPV和F值分别为99.82%(1128/1130)、99.88%(16410/16429)、94.6%(223/239)、100.00%(16410/16412)和99.0%(100×2×0.983×0.998/[0.983 + 0.998])。计算机辅助过程运行只需几秒钟,尽管创建词典估计需要30小时。人工分类大约需要80人时。

结论

这项工作的关键步骤是创建准确且全面的词典,以确保不会遗漏任何潜在病例。从大型数据库的SS/WS选择子集中去除假阳性术语比在原始数据库中搜索潜在假阴性要容易得多。使用此方法的好处与要分析的数据集大小成正比。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51fb/5509949/31c60e5dc94c/medinform_v5i2e17_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验