Georgia State University, Atlanta, GA, USA.
Neuroinformatics. 2022 Jan;20(1):91-108. doi: 10.1007/s12021-021-09525-8. Epub 2021 May 4.
The field of neuroimaging can greatly benefit from building machine learning models to detect and predict diseases, and discover novel biomarkers, but much of the data collected at various organizations and research centers is unable to be shared due to privacy or regulatory concerns (especially for clinical data or rare disorders). In addition, aggregating data across multiple large studies results in a huge amount of duplicated technical debt and the resources required can be challenging or impossible for an individual site to build. Training on the data distributed across organizations can result in models that generalize much better than models trained on data from any of organizations alone. While there are approaches for decentralized sharing, these often do not provide the highest possible guarantees of sample privacy that only cryptography can provide. In addition, such approaches are often focused on probabilistic solutions. In this paper, we propose an approach that leverages the potential of datasets spread among a number of data collecting organizations by performing joint analyses in a secure and deterministic manner when only encrypted data is shared and manipulated. The approach is based on secure multiparty computation which refers to cryptographic protocols that enable distributed computation of a function over distributed inputs without revealing additional information about the inputs. It enables multiple organizations to train machine learning models on their joint data and apply the trained models to encrypted data without revealing their sensitive data to the other parties. In our proposed approach, organizations (or sites) securely collaborate to build a machine learning model as it would have been trained on the aggregated data of all the organizations combined. Importantly, the approach does not require a trusted party (i.e. aggregator), each contributing site plays an equal role in the process, and no site can learn individual data of any other site. We demonstrate effectiveness of the proposed approach, in a range of empirical evaluations using different machine learning algorithms including logistic regression and convolutional neural network models on human structural and functional magnetic resonance imaging datasets.
神经影像学领域可以从构建机器学习模型以检测和预测疾病、发现新的生物标志物中受益匪浅,但由于隐私或监管问题(特别是对于临床数据或罕见疾病),许多在不同组织和研究中心收集的数据无法共享。此外,跨多个大型研究汇总数据会导致大量重复的技术债务,而单个站点构建所需的资源可能具有挑战性甚至不可能。在跨组织分布的数据上进行训练会导致模型的泛化能力比仅在任何一个组织的数据上训练的模型要好得多。虽然有分散式共享的方法,但这些方法通常不能提供只有加密才能提供的最高样本隐私保证。此外,此类方法通常侧重于概率解决方案。在本文中,我们提出了一种方法,通过仅在共享和操作加密数据时以安全且确定的方式执行联合分析,利用分布在多个数据收集组织中的数据集的潜力。该方法基于安全多方计算,它是指允许在分布式输入上分布式计算函数的加密协议,而不会泄露有关输入的其他信息。它使多个组织能够在其联合数据上训练机器学习模型,并将经过训练的模型应用于加密数据,而无需向其他各方泄露其敏感数据。在我们提出的方法中,组织(或站点)安全地协作构建机器学习模型,就像在所有组织的聚合数据上进行训练一样。重要的是,该方法不需要可信方(即聚合器),每个贡献站点在该过程中都扮演着平等的角色,并且没有站点可以学习任何其他站点的个人数据。我们使用不同的机器学习算法(包括逻辑回归和卷积神经网络模型)在人类结构和功能磁共振成像数据集上进行了一系列实证评估,证明了所提出方法的有效性。