Johns Hopkins University Applied Physics Laboratory, 11100 Johns Hopkins Rd, Laurel, MD 20723-6099, USA.
BMC Med Inform Decis Mak. 2010 Oct 14;10:59. doi: 10.1186/1472-6947-10-59.
New algorithms for disease outbreak detection are being developed to take advantage of full electronic medical records (EMRs) that contain a wealth of patient information. However, due to privacy concerns, even anonymized EMRs cannot be shared among researchers, resulting in great difficulty in comparing the effectiveness of these algorithms. To bridge the gap between novel bio-surveillance algorithms operating on full EMRs and the lack of non-identifiable EMR data, a method for generating complete and synthetic EMRs was developed.
This paper describes a novel methodology for generating complete synthetic EMRs both for an outbreak illness of interest (tularemia) and for background records. The method developed has three major steps: 1) synthetic patient identity and basic information generation; 2) identification of care patterns that the synthetic patients would receive based on the information present in real EMR data for similar health problems; 3) adaptation of these care patterns to the synthetic patient population.
We generated EMRs, including visit records, clinical activity, laboratory orders/results and radiology orders/results for 203 synthetic tularemia outbreak patients. Validation of the records by a medical expert revealed problems in 19% of the records; these were subsequently corrected. We also generated background EMRs for over 3000 patients in the 4-11 yr age group. Validation of those records by a medical expert revealed problems in fewer than 3% of these background patient EMRs and the errors were subsequently rectified.
A data-driven method was developed for generating fully synthetic EMRs. The method is general and can be applied to any data set that has similar data elements (such as laboratory and radiology orders and results, clinical activity, prescription orders). The pilot synthetic outbreak records were for tularemia but our approach may be adapted to other infectious diseases. The pilot synthetic background records were in the 4-11 year old age group. The adaptations that must be made to the algorithms to produce synthetic background EMRs for other age groups are indicated.
新的疾病爆发检测算法正在被开发出来,以充分利用包含丰富患者信息的电子病历(EMR)。然而,由于隐私问题,即使是匿名的 EMR 也不能在研究人员之间共享,这导致了比较这些算法的有效性的巨大困难。为了弥合在完整 EMR 上运行的新型生物监测算法与缺乏不可识别的 EMR 数据之间的差距,开发了一种生成完整合成 EMR 的方法。
本文描述了一种新颖的方法,用于生成感兴趣的爆发疾病(土拉热)和背景记录的完整合成 EMR。所开发的方法有三个主要步骤:1)合成患者身份和基本信息生成;2)根据真实 EMR 数据中类似健康问题的信息,确定合成患者将接受的护理模式;3)将这些护理模式应用于合成患者群体。
我们生成了包括 203 名合成土拉热爆发患者的就诊记录、临床活动、实验室订单/结果和放射学订单/结果在内的 EMR。医学专家对记录的验证显示,19%的记录存在问题;这些问题随后得到了纠正。我们还为 4-11 岁年龄组的 3000 多名患者生成了背景 EMR。医学专家对这些背景患者 EMR 的验证显示,不到 3%的记录存在问题,随后这些错误得到了纠正。
开发了一种数据驱动的生成完全合成 EMR 的方法。该方法具有通用性,可应用于具有类似数据元素(如实验室和放射学订单和结果、临床活动、处方订单)的任何数据集。试点合成爆发记录是针对土拉热的,但我们的方法可以适用于其他传染病。试点合成背景记录是在 4-11 岁年龄组。指出了为其他年龄组生成合成背景 EMR 时必须对算法进行的调整。