Tandon School of Engineering, New York University, Brooklyn, NY, 11201, USA.
Center for Cyber Security, New York University Abu Dhabi, Abu Dhabi, 129188, UAE.
Sci Rep. 2023 Jan 30;13(1):1661. doi: 10.1038/s41598-023-28481-8.
Cancer genomics tailors diagnosis and treatment based on an individual's genetic information and is the crux of precision medicine. However, analysis and maintenance of high volume of genetic mutation data to build a machine learning (ML) model to predict the cancer type is a computationally expensive task and is often outsourced to powerful cloud servers, raising critical privacy concerns for patients' data. Homomorphic encryption (HE) enables computation on encrypted data, thus, providing cryptographic guarantees to protect privacy. But restrictive overheads of encrypted computation deter its usage. In this work, we explore the challenges of privacy preserving cancer type prediction using a dataset consisting of more than 2 million genetic mutations from 2713 patients for several cancer types by building a highly accurate ML model and then implementing its privacy preserving version in HE. Our solution for cancer type inference encodes somatic mutations based on their impact on the cancer genomes into the feature space and then uses statistical tests for feature selection. We propose a fast matrix multiplication algorithm for HE-based model. Our final model achieves 0.98 micro-average area under curve improving accuracy from 70.08 to 83.61% , being 550 times faster than the standard matrix multiplication-based privacy-preserving models. Our tool can be found at https://github.com/momalab/octal-candet .
癌症基因组学根据个体的遗传信息来量身定制诊断和治疗方案,是精准医疗的核心。然而,分析和维护大量的基因突变数据以构建机器学习 (ML) 模型来预测癌症类型是一项计算成本很高的任务,通常外包给功能强大的云服务器,这引发了患者数据的重大隐私问题。同态加密 (HE) 可以对加密数据进行计算,从而为保护隐私提供密码学保证。但是,加密计算的限制开销阻碍了其使用。在这项工作中,我们通过构建一个高度准确的 ML 模型来探索使用包含来自 2713 名患者的超过 200 万种基因突变的数据集进行隐私保护的癌症类型预测的挑战,然后在 HE 中实现其隐私保护版本。我们用于癌症类型推断的解决方案基于它们对癌症基因组的影响将体细胞突变编码到特征空间中,然后使用统计检验进行特征选择。我们为基于 HE 的模型提出了一种快速矩阵乘法算法。我们的最终模型在微平均 AUC 上达到 0.98,将准确率从 70.08%提高到 83.61%,比基于标准矩阵乘法的隐私保护模型快 550 倍。我们的工具可以在 https://github.com/momalab/octal-candet 找到。