School of Public Health and Community Medicine, UNSW Sydney, Australia; School of Electrical and Data Engineering, Faculty of Electrical and Information Technology, University of Technology Sydney, Australia.
School of Public Health and Community Medicine, UNSW Sydney, Australia; WHO Collaborating Centre for eHealth, UNSW Sydney, Australia.
J Biomed Inform. 2019 Jul;95:103220. doi: 10.1016/j.jbi.2019.103220. Epub 2019 May 31.
Identifying unique patients across multiple care facilities or services is a major challenge in providing continuous care and undertaking health research. Identifying and linking patients without compromising privacy and security is an emerging issue in the big data era. The large quantity and complexity of the patient data emphasize the need for effective linkage methods that are both scalable and accurate. In this study, we aim to develop and evaluate an ensemble classification method using the three most typically used supervised learning methods, namely support vector machines, logistic regression and standard feed-forward neural networks, to link records that belong to the same patient across multiple service locations. Our ensemble method is the combination of bagging and stacking. Each base learner's critical hyperparameters were selected through grid search technique. Two synthetic datasets were used in this study namely FEBRL and ePBRN. ePBRN linkage dataset was based on linkage errors noticed in the Australian primary care setting. The overall linkage performance was determined by assessing the blocking performance and classification performance. Our ensemble method outperformed the base learners in all evaluation metrics on one dataset. More specifically, the precision, which is average of individual precision scores in case of base learners increased from 90.70% to 94.85% in FEBRL, and from 62.17% to 99.28% in ePBRN. Similarly, the F-score increased from 94.92% to 98.18% in FEBRL, and from 72.99% to 91.72% in ePBRN. Our experiments suggest that we can significantly improve the linkage performance of individual algorithms by employing ensemble strategies.
在提供连续护理和进行健康研究方面,识别多个护理机构或服务中的独特患者是一个主要挑战。在大数据时代,在不损害隐私和安全的情况下识别和链接患者是一个新兴问题。患者数据的大量和复杂性强调了需要有效的链接方法,这些方法既具有可扩展性又准确。在这项研究中,我们旨在开发和评估一种集成分类方法,该方法使用三种最常用的监督学习方法,即支持向量机、逻辑回归和标准前馈神经网络,以链接属于多个服务位置的同一患者的记录。我们的集成方法是袋装和堆叠的组合。每个基础学习者的关键超参数都是通过网格搜索技术选择的。本研究使用了两个合成数据集,即 FEBRL 和 ePBRN。ePBRN 链接数据集基于澳大利亚初级保健环境中发现的链接错误。整体链接性能通过评估阻塞性能和分类性能来确定。在一个数据集上,我们的集成方法在所有评估指标上都优于基础学习者。具体来说,在 FEBRL 中,精度(即基础学习者的个体精度得分的平均值)从 90.70%提高到 94.85%,在 ePBRN 中从 62.17%提高到 99.28%。类似地,在 FEBRL 中,F 分数从 94.92%提高到 98.18%,在 ePBRN 中从 72.99%提高到 91.72%。我们的实验表明,通过采用集成策略,我们可以显著提高单个算法的链接性能。