Liao Wei, Voldman Joel
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America.
PLOS Digit Health. 2024 Oct 21;3(10):e0000640. doi: 10.1371/journal.pdig.0000640. eCollection 2024 Oct.
Recent work in machine learning for healthcare has raised concerns about patient privacy and algorithmic fairness. Previous work has shown that self-reported race can be predicted from medical data that does not explicitly contain racial information. However, the extent of data identification is unknown, and we lack ways to develop models whose outcomes are minimally affected by such information. Here we systematically investigated the ability of time-series electronic health record data to predict patient static information. We found that not only the raw time-series data, but also learned representations from machine learning models, can be trained to predict a variety of static information with area under the receiver operating characteristic curve as high as 0.851 for biological sex, 0.869 for binarized age and 0.810 for self-reported race. Such high predictive performance can be extended to various comorbidity factors and exists even when the model was trained for different tasks, using different cohorts, using different model architectures and databases. Given the privacy and fairness concerns these findings pose, we develop a variational autoencoder-based approach that learns a structured latent space to disentangle patient-sensitive attributes from time-series data. Our work thoroughly investigates the ability of machine learning models to encode patient static information from time-series electronic health records and introduces a general approach to protect patient-sensitive information for downstream tasks.
机器学习在医疗保健领域的最新进展引发了对患者隐私和算法公平性的担忧。先前的研究表明,可以从并未明确包含种族信息的医疗数据中预测自我报告的种族。然而,数据识别的程度尚不清楚,而且我们缺乏开发其结果受此类信息影响最小的模型的方法。在此,我们系统地研究了时间序列电子健康记录数据预测患者静态信息的能力。我们发现,不仅原始时间序列数据,而且机器学习模型的学习表征,都可以经过训练来预测各种静态信息,对于生物性别,受试者工作特征曲线下面积高达0.851;对于二值化年龄,为0.869;对于自我报告的种族,为0.810。这种高预测性能可以扩展到各种合并症因素,并且即使在使用不同队列、不同模型架构和数据库针对不同任务训练模型时也依然存在。鉴于这些发现引发的隐私和公平性问题,我们开发了一种基于变分自编码器的方法,该方法学习一个结构化的潜在空间,以从时间序列数据中解开患者敏感属性。我们的工作全面研究了机器学习模型从时间序列电子健康记录中编码患者静态信息的能力,并引入了一种用于保护下游任务中患者敏感信息的通用方法。