Liu Dianbo, Fox Kathe, Weber Griffin, Miller Tim
Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, United States; Department of Pediatrics, Harvard Medical School, Boston, MA, United States; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States; Computer Science & Artificial Intelligence Laboratory, MIT, Cambridge, MA, United States.
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States; Aetna, CVS Health, Boston, MA, United States.
J Biomed Inform. 2022 Oct;134:104151. doi: 10.1016/j.jbi.2022.104151. Epub 2022 Jul 22.
A patient's health information is generally fragmented across silos because it follows how care is delivered: multiple providers in multiple settings. Though it is technically feasible to reunite data for analysis in a manner that underpins a rapid learning healthcare system, privacy concerns and regulatory barriers limit data centralization for this purpose.
Machine learning can be conducted in a federated manner on patient datasets with the same set of variables but separated across storage. But federated learning cannot handle the situation where different data types for a given patient are separated vertically across different organizations and when patient ID matching across different institutions is difficult. We call methods that enable machine learning model training on data separated by two or more dimensions "confederated machine learning", which we aim to develop in this study.
We propose and evaluate confederated learning for training machine learning models to stratify the risk of several diseases among silos when data are horizontally separated by individual, vertically separated by data type, and separated by identity without patient ID matching. The confederated learning method can be intuitively understood as a distributed learning method with representation learning, generative model, imputation method and data augmentation elements.
Our confederated learning method achieves AUCROC (Area Under The Curve Receiver Operating Characteristics) of 0.787 for diabetes prediction, 0.718 for psychological disorders prediction, and 0.698 for Ischemic heart disease prediction using nationwide health insurance claims.
Our proposed confederated learning method successfully trained machine learning models on health insurance data separated by two or more dimensions.
患者的健康信息通常分散在各个孤岛中,因为它遵循医疗服务的提供方式:多个环境中的多个提供者。尽管从技术上讲,以支持快速学习医疗系统的方式整合数据进行分析是可行的,但隐私问题和监管障碍限制了为此目的进行的数据集中化。
机器学习可以在具有相同变量集但存储分离的患者数据集上以联邦方式进行。但是联邦学习无法处理给定患者的不同数据类型在不同组织之间垂直分离以及不同机构之间患者身份匹配困难的情况。我们将能够在由两个或更多维度分离的数据上进行机器学习模型训练的方法称为“联合机器学习”,我们旨在在本研究中开发这种方法。
我们提出并评估联合学习,以训练机器学习模型,在数据按个体水平分离、按数据类型垂直分离且按身份分离但无患者身份匹配的情况下,对多个孤岛中的几种疾病风险进行分层。联合学习方法可以直观地理解为一种具有表示学习、生成模型、插补方法和数据增强元素的分布式学习方法。
我们的联合学习方法在使用全国健康保险理赔数据进行糖尿病预测时,曲线下面积(AUCROC)达到0.787,心理障碍预测为0.718,缺血性心脏病预测为0.698。
我们提出的联合学习方法成功地在由两个或更多维度分离的健康保险数据上训练了机器学习模型。