Sadilek Adam, Liu Luyang, Nguyen Dung, Kamruzzaman Methun, Serghiou Stylianos, Rader Benjamin, Ingerman Alex, Mellem Stefan, Kairouz Peter, Nsoesie Elaine O, MacFarlane Jamie, Vullikanti Anil, Marathe Madhav, Eastham Paul, Brownstein John S, Arcas Blaise Aguera Y, Howell Michael D, Hernandez John
Google, Mountain View, CA, USA.
Biocomplexity Institute, University of Virginia, Charlottesville, VA, USA.
NPJ Digit Med. 2021 Sep 7;4(1):132. doi: 10.1038/s41746-021-00489-2.
Privacy protection is paramount in conducting health research. However, studies often rely on data stored in a centralized repository, where analysis is done with full access to the sensitive underlying content. Recent advances in federated learning enable building complex machine-learned models that are trained in a distributed fashion. These techniques facilitate the calculation of research study endpoints such that private data never leaves a given device or healthcare system. We show-on a diverse set of single and multi-site health studies-that federated models can achieve similar accuracy, precision, and generalizability, and lead to the same interpretation as standard centralized statistical models while achieving considerably stronger privacy protections and without significantly raising computational costs. This work is the first to apply modern and general federated learning methods that explicitly incorporate differential privacy to clinical and epidemiological research-across a spectrum of units of federation, model architectures, complexity of learning tasks and diseases. As a result, it enables health research participants to remain in control of their data and still contribute to advancing science-aspects that used to be at odds with each other.
隐私保护在开展健康研究中至关重要。然而,研究通常依赖存储在集中式存储库中的数据,在那里进行分析时可以完全访问敏感的基础内容。联邦学习的最新进展使得能够构建以分布式方式训练的复杂机器学习模型。这些技术有助于计算研究终点,从而使私人数据永远不会离开给定的设备或医疗系统。我们在一系列单站点和多站点健康研究中表明,联邦模型可以实现相似的准确性、精确性和泛化能力,并且在实现更强隐私保护且不会显著增加计算成本的情况下,能得出与标准集中式统计模型相同的解释。这项工作首次将明确纳入差分隐私的现代通用联邦学习方法应用于临床和流行病学研究——涵盖了一系列联邦单元、模型架构、学习任务复杂性和疾病。因此,它使健康研究参与者能够掌控自己的数据,同时仍能为推动科学发展做出贡献——而这些方面过去常常相互矛盾。