Computer Science and Engineering Department, University of Connecticut, Storrs, CT, 06269, USA.
Department of Statistics, University of Connecticut, Storrs, CT, 06269, USA.
Sci Rep. 2023 Feb 25;13(1):3292. doi: 10.1038/s41598-023-29132-8.
Recent advances in technology have led to an explosion of data in virtually all domains of our lives. Modern biomedical devices can acquire a large number of physical readings from patients. Often, these readings are stored in the form of time series data. Such time series data can form the basis for important research to advance healthcare and well being. Due to several considerations including data size, patient privacy, etc., the original, full data may not be available to secondary parties or researchers. Instead, suppose that a subset of the data is made available. A fast and reliable record linkage algorithm enables us to accurately match patient records in the original and subset databases while maintaining privacy. The problem of record linkage when the attributes include time series has not been studied much in the literature. We introduce two main contributions in this paper. First, we propose a novel, very efficient, and scalable record linkage algorithm that is employed on time series data. This algorithm is 400× faster than the previous work. Second, we introduce a privacy preserving framework that enables health institutions to safely release their raw time series records to researchers with bare minimum amount of identifying information.
近年来,技术的进步使得我们生活的几乎所有领域都产生了大量的数据。现代生物医学设备可以从患者那里获取大量的物理读数。这些读数通常以时间序列数据的形式存储。这种时间序列数据可以作为推进医疗保健和健康的重要研究的基础。由于包括数据大小、患者隐私等在内的几个因素,原始的、完整的数据可能无法提供给二级方或研究人员。相反,假设只提供数据的一个子集。一个快速可靠的记录链接算法可以在保持隐私的同时,准确地匹配原始和子集数据库中的患者记录。在文献中,关于包含时间序列的属性的记录链接问题并没有得到太多研究。本文主要有两方面的贡献。首先,我们提出了一种新颖的、非常高效的、可扩展的记录链接算法,该算法应用于时间序列数据。这个算法比之前的工作快 400 倍。其次,我们引入了一个隐私保护框架,使医疗机构能够以最小的身份信息量将原始时间序列记录安全地提供给研究人员。