Computer & Mathematical Sciences, University of Toronto, Toronto, ON, Canada.
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States.
J Med Internet Res. 2020 Nov 3;22(11):e18735. doi: 10.2196/18735.
Over the past decade, the emergence of several large federated clinical data networks has enabled researchers to access data on millions of patients at dozens of health care organizations. Typically, queries are broadcast to each of the sites in the network, which then return aggregate counts of the number of matching patients. However, because patients can receive care from multiple sites in the network, simply adding the numbers frequently double counts patients. Various methods such as the use of trusted third parties or secure multiparty computation have been proposed to link patient records across sites. However, they either have large trade-offs in accuracy and privacy or are not scalable to large networks.
This study aims to enable accurate estimates of the number of patients matching a federated query while providing strong guarantees on the amount of protected medical information revealed.
We introduce a novel probabilistic approach to running federated network queries. It combines an algorithm called HyperLogLog with obfuscation in the form of hashing, masking, and homomorphic encryption. It is tunable, in that it allows networks to balance accuracy versus privacy, and it is computationally efficient even for large networks. We built a user-friendly free open-source benchmarking platform to simulate federated queries in large hospital networks. Using this platform, we compare the accuracy, k-anonymity privacy risk (with k=10), and computational runtime of our algorithm with several existing techniques.
In simulated queries matching 1 to 100 million patients in a 100-hospital network, our method was significantly more accurate than adding aggregate counts while maintaining k-anonymity. On average, it required a total of 12 kilobytes of data to be sent to the network hub and added only 5 milliseconds to the overall federated query runtime. This was orders of magnitude better than other approaches, which guaranteed the exact answer.
Using our method, it is possible to run highly accurate federated queries of clinical data repositories that both protect patient privacy and scale to large networks.
在过去的十年中,出现了几个大型的联合临床数据网络,使研究人员能够访问数十个医疗机构的数百万患者的数据。通常,查询会广播到网络中的每个站点,然后返回匹配患者数量的汇总计数。但是,由于患者可以在网络中的多个站点接受治疗,简单地相加数字经常会重复计算患者。已经提出了各种方法,例如使用可信第三方或安全多方计算,来在站点之间链接患者记录。但是,它们要么在准确性和隐私性方面存在很大的权衡,要么无法扩展到大型网络。
本研究旨在实现对符合联合查询的患者数量进行准确估计,同时对所揭示的受保护医疗信息数量提供强有力的保证。
我们引入了一种新的概率方法来运行联合网络查询。它结合了一种称为 HyperLogLog 的算法和以哈希、掩码和同态加密形式的混淆。它是可调整的,即它允许网络在准确性与隐私性之间进行平衡,并且即使对于大型网络,它的计算效率也很高。我们构建了一个用户友好的免费开源基准测试平台,以模拟大型医院网络中的联合查询。使用该平台,我们将我们的算法与几种现有技术的准确性、k-匿名隐私风险(k=10)和计算运行时进行了比较。
在模拟的查询中,在一个 100 家医院的网络中匹配 1 到 1000 万患者,我们的方法在保持 k-匿名性的同时,比添加汇总计数要准确得多。平均而言,总共需要向网络中心发送 12 千字节的数据,并且只需要增加 5 毫秒的联邦查询总运行时间。这比其他保证准确答案的方法要好几个数量级。
使用我们的方法,可以运行对临床数据存储库进行高度准确的联合查询,既能保护患者隐私,又能扩展到大型网络。