Suppr超能文献

利用癌症患者 RNA 表达数据进行机器学习的个人健康信息推断:算法验证研究。

Personal Health Information Inference Using Machine Learning on RNA Expression Data from Patients With Cancer: Algorithm Validation Study.

机构信息

Department of Biomedical System Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea.

Department of Medical Engineering, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea.

出版信息

J Med Internet Res. 2020 Aug 10;22(8):e18387. doi: 10.2196/18387.

Abstract

BACKGROUND

As the need for sharing genomic data grows, privacy issues and concerns, such as the ethics surrounding data sharing and disclosure of personal information, are raised.

OBJECTIVE

The main purpose of this study was to verify whether genomic data is sufficient to predict a patient's personal information.

METHODS

RNA expression data and matched patient personal information were collected from 9538 patients in The Cancer Genome Atlas program. Five personal information variables (age, gender, race, cancer type, and cancer stage) were recorded for each patient. Four different machine learning algorithms (support vector machine, decision tree, random forest, and artificial neural network) were used to determine whether a patient's personal information could be accurately predicted from RNA expression data. Performance measurement of the prediction models was based on the accuracy and area under the receiver operating characteristic curve. We selected five cancer types (breast carcinoma, kidney renal clear cell carcinoma, head and neck squamous cell carcinoma, low-grade glioma, and lung adenocarcinoma) with large samples sizes to verify whether predictive accuracy would differ between them. We also validated the efficacy of our four machine learning models in analyzing normal samples from 593 cancer patients.

RESULTS

In most samples, personal information with high genetic relevance, such as gender and cancer type, could be predicted from RNA expression data alone. The prediction accuracies for gender and cancer type, which were the best models, were 0.93-0.99 and 0.78-0.94, respectively. Other aspects of personal information, such as age, race, and cancer stage, were difficult to predict from RNA expression data, with accuracies ranging from 0.0026-0.29, 0.76-0.96, and 0.45-0.79, respectively. Among the tested machine learning methods, the highest predictive accuracy was obtained using the support vector machine algorithm (mean accuracy 0.77), while the lowest accuracy was obtained using the random forest method (mean accuracy 0.65). Gender and race were predicted more accurately than other variables in the samples. On average, the accuracy of cancer stage prediction ranged between 0.71-0.67, while the age prediction accuracy ranged between 0.18-0.23 for the five cancer types.

CONCLUSIONS

We attempted to predict patient information using RNA expression data. We found that some identifiers could be predicted, but most others could not. This study showed that personal information available from RNA expression data is limited and this information cannot be used to identify specific patients.

摘要

背景

随着分享基因组数据的需求不断增长,隐私问题和担忧也随之出现,例如数据共享的伦理问题和个人信息的披露问题。

目的

本研究的主要目的是验证基因组数据是否足以预测患者的个人信息。

方法

从癌症基因组图谱计划中的 9538 名患者中收集了 RNA 表达数据和匹配的患者个人信息。为每位患者记录了五个个人信息变量(年龄、性别、种族、癌症类型和癌症分期)。使用四种不同的机器学习算法(支持向量机、决策树、随机森林和人工神经网络)来确定是否可以从 RNA 表达数据中准确预测患者的个人信息。预测模型的性能测量基于准确性和接收者操作特征曲线下的面积。我们选择了五种样本量较大的癌症类型(乳腺癌、肾透明细胞癌、头颈部鳞状细胞癌、低级别胶质瘤和肺腺癌)来验证预测准确性是否会有所不同。我们还验证了我们的四种机器学习模型在分析 593 名癌症患者的正常样本时的功效。

结果

在大多数样本中,与遗传相关性较高的个人信息(如性别和癌症类型)可以仅从 RNA 表达数据中预测。性别和癌症类型这两个最佳模型的预测准确率分别为 0.93-0.99 和 0.78-0.94。其他方面的个人信息,如年龄、种族和癌症分期,从 RNA 表达数据中预测难度较大,准确率范围为 0.0026-0.29、0.76-0.96 和 0.45-0.79。在所测试的机器学习方法中,支持向量机算法的预测准确率最高(平均准确率为 0.77),而随机森林算法的准确率最低(平均准确率为 0.65)。性别和种族在样本中的预测准确率高于其他变量。平均而言,五种癌症类型的癌症分期预测准确率在 0.71-0.67 之间,年龄预测准确率在 0.18-0.23 之间。

结论

我们尝试使用 RNA 表达数据预测患者信息。我们发现一些标识符可以被预测,但大多数其他标识符不能。本研究表明,从 RNA 表达数据中获取的个人信息是有限的,并且无法使用这些信息来识别特定患者。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aae/7445622/1b7007b29c8e/jmir_v22i8e18387_fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验