Daymont Carrie, Ross Michelle E, Russell Localio A, Fiks Alexander G, Wasserman Richard C, Grundmeier Robert W
Departments of Pediatrics and Public Health Sciences, Penn State College of Medicine, Hershey, PA, USA.
Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
J Am Med Inform Assoc. 2017 Nov 1;24(6):1080-1087. doi: 10.1093/jamia/ocx037.
Large electronic health record (EHR) datasets are increasingly used to facilitate research on growth, but measurement and recording errors can lead to biased results. We developed and tested an automated method for identifying implausible values in pediatric EHR growth data.
Using deidentified data from 46 primary care sites, we developed an algorithm to identify weight and height values that should be excluded from analysis, including implausible values and values that were recorded repeatedly without remeasurement. The foundation of the algorithm is a comparison of each measurement, expressed as a standard deviation score, with a weighted moving average of a child's other measurements. We evaluated the performance of the algorithm by (1) comparing its results with the judgment of physician reviewers for a stratified random selection of 400 measurements and (2) evaluating its accuracy in a dataset with simulated errors.
Of 2 000 595 growth measurements from 280 610 patients 1 to 21 years old, 3.8% of weight and 4.5% of height values were identified as implausible or excluded for other reasons. The proportion excluded varied widely by primary care site. The automated method had a sensitivity of 97% (95% confidence interval [CI], 94-99%) and a specificity of 90% (95% CI, 85-94%) for identifying implausible values compared to physician judgment, and identified 95% (weight) and 98% (height) of simulated errors.
This automated, flexible, and validated method for preparing large datasets will facilitate the use of pediatric EHR growth datasets for research.
大型电子健康记录(EHR)数据集越来越多地用于促进生长研究,但测量和记录误差可能导致结果有偏差。我们开发并测试了一种自动方法,用于识别儿科EHR生长数据中不合理的值。
利用来自46个初级保健机构的去识别化数据,我们开发了一种算法,以识别应从分析中排除的体重和身高值,包括不合理的值以及未经重新测量而重复记录的值。该算法的基础是将每个测量值(表示为标准差分数)与儿童其他测量值的加权移动平均值进行比较。我们通过以下方式评估该算法的性能:(1)将其结果与医生评审员对分层随机选择的400个测量值的判断进行比较;(2)在一个带有模拟误差的数据集中评估其准确性。
在来自280610名1至21岁患者的2000595次生长测量中,3.8%的体重值和4.5%的身高值被确定为不合理或因其他原因被排除。排除比例在不同初级保健机构之间差异很大。与医生的判断相比,该自动方法识别不合理值的灵敏度为97%(95%置信区间[CI],94 - 99%),特异性为90%(95%CI,85 - 94%),并且识别出了95%(体重)和98%(身高)的模拟误差。
这种用于准备大型数据集的自动化、灵活且经过验证的方法将有助于利用儿科EHR生长数据集进行研究。