Brown Adrian Paul, Randall Sean M
Centre for Data Linkage, Curtin University, Bentley, Australia.
JMIR Med Inform. 2020 Sep 23;8(9):e18920. doi: 10.2196/18920.
The linking of administrative data across agencies provides the capability to investigate many health and social issues with the potential to deliver significant public benefit. Despite its advantages, the use of cloud computing resources for linkage purposes is scarce, with the storage of identifiable information on cloud infrastructure assessed as high risk by data custodians.
This study aims to present a model for record linkage that utilizes cloud computing capabilities while assuring custodians that identifiable data sets remain secure and local.
A new hybrid cloud model was developed, including privacy-preserving record linkage techniques and container-based batch processing. An evaluation of this model was conducted with a prototype implementation using large synthetic data sets representative of administrative health data.
The cloud model kept identifiers on premises and uses privacy-preserved identifiers to run all linkage computations on cloud infrastructure. Our prototype used a managed container cluster in Amazon Web Services to distribute the computation using existing linkage software. Although the cost of computation was relatively low, the use of existing software resulted in an overhead of processing of 35.7% (149/417 min execution time).
The result of our experimental evaluation shows the operational feasibility of such a model and the exciting opportunities for advancing the analysis of linkage outputs.
跨机构行政数据的关联提供了调查诸多健康和社会问题的能力,有可能带来重大的公共利益。尽管有其优势,但将云计算资源用于数据关联目的的情况却很少见,数据保管人认为在云基础设施上存储可识别信息具有高风险。
本研究旨在提出一种记录关联模型,该模型利用云计算能力,同时向保管人保证可识别数据集的安全性和本地化。
开发了一种新的混合云模型,包括隐私保护记录关联技术和基于容器的批处理。使用代表行政健康数据的大型合成数据集通过原型实现对该模型进行了评估。
该云模型将标识符保留在本地,并使用隐私保护标识符在云基础设施上运行所有关联计算。我们的原型在亚马逊网络服务中使用了一个托管容器集群,以使用现有的关联软件来分配计算任务。虽然计算成本相对较低,但使用现有软件导致处理开销为35.7%(执行时间为149/417分钟)。
我们实验评估的结果表明了这种模型的操作可行性以及推进关联输出分析的令人兴奋的机会。