Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, Vascular Sciences and Public Health, University of Padova, Via Loredan, 18, 35131 Padova, Italy.
Division of Pediatric Infectious Diseases, Department of Women's and Children's Health, University of Padova, 35131 Padova, Italy.
Int J Environ Res Public Health. 2022 May 13;19(10):5959. doi: 10.3390/ijerph19105959.
The burden of infectious diseases is crucial for both epidemiological surveillance and prompt public health response. A variety of data, including textual sources, can be fruitfully exploited. Dealing with unstructured data necessitates the use of methods for automatic data-driven variable construction and machine learning techniques (MLT) show promising results. In this framework, varicella-zoster virus (VZV) infection was chosen to perform an automatic case identification with MLT. Pedianet, an Italian pediatric primary care database, was used to train a series of models to identify whether a child was diagnosed with VZV infection between 2004 and 2014 in the Veneto region, starting from free text fields. Given the nature of the task, a recurrent neural network (RNN) with bidirectional gated recurrent units (GRUs) was chosen; the same models were then used to predict the children's status for the following years. A gold standard produced by manual extraction for the same interval was available for comparison. RNN-GRU improved its performance over time, reaching the maximum value of area under the ROC curve (AUC-ROC) of 95.30% at the end of the period. The absolute bias in estimates of VZV infection was below 1.5% in the last five years analyzed. The findings in this study could assist the large-scale use of EHRs for clinical outcome predictive modeling and help establish high-performance systems in other medical domains.
传染病负担对于流行病学监测和及时的公共卫生响应至关重要。可以充分利用各种数据,包括文本来源。处理非结构化数据需要使用自动数据驱动的变量构建方法和机器学习技术(MLT),这些方法显示出有前途的结果。在这个框架中,我们选择水痘带状疱疹病毒(VZV)感染来使用 MLT 进行自动病例识别。利用意大利儿科初级保健数据库 Pedianet,我们针对威尼托地区在 2004 年至 2014 年间从自由文本字段中诊断儿童 VZV 感染的情况,训练了一系列模型。鉴于任务的性质,我们选择了具有双向门控循环单元(GRU)的递归神经网络(RNN);然后,我们使用相同的模型来预测未来几年儿童的状况。同一时间段的手动提取产生了一个黄金标准,可用于比较。RNN-GRU 的性能随着时间的推移而提高,在该期间结束时达到了 95.30%的 ROC 曲线下面积(AUC-ROC)的最大值。在分析的最后五年中,VZV 感染估计的绝对偏差低于 1.5%。本研究的结果可以帮助大规模使用电子健康记录进行临床结果预测建模,并帮助在其他医学领域建立高性能系统。