Gootjes-Dreesbach Luise, Sood Meemansa, Sahay Akrishta, Hofmann-Apitius Martin, Fröhlich Holger
UCB Pharma (UCB Celltech Ltd.), Slough, United Kingdom.
Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany.
Front Big Data. 2020 May 28;3:16. doi: 10.3389/fdata.2020.00016. eCollection 2020.
In the area of Big Data, one of the major obstacles for the progress of biomedical research is the existence of data "silos" because legal and ethical constraints often do not allow for sharing sensitive patient data from clinical studies across institutions. While federated machine learning now allows for building models from scattered data of the same format, there is still the need to investigate, mine, and understand data of separate and very differently designed clinical studies that can only be accessed within each of the data-hosting organizations. Simulation of sufficiently realistic virtual patients based on the data within each individual organization could be a way to fill this gap. In this work, we propose a new machine learning approach [Variational Autoencoder Modular Bayesian Network (VAMBN)] to learn a generative model of longitudinal clinical study data. VAMBN considers typical key aspects of such data, namely limited sample size coupled with comparable many variables of different numerical scales and statistical properties, and many missing values. We show that with VAMBN, we can simulate virtual patients in a sufficiently realistic manner while making theoretical guarantees on data privacy. In addition, VAMBN allows for simulating counterfactual scenarios. Hence, VAMBN could facilitate data sharing as well as design of clinical trials.
在大数据领域,生物医学研究进展的主要障碍之一是数据“孤岛”的存在,因为法律和伦理限制通常不允许跨机构共享临床研究中的敏感患者数据。虽然联邦机器学习现在允许从相同格式的分散数据构建模型,但仍有必要对只能在每个数据托管组织内部访问的、设计截然不同的单独临床研究数据进行调查、挖掘和理解。基于每个组织内的数据模拟足够逼真的虚拟患者可能是填补这一空白的一种方法。在这项工作中,我们提出了一种新的机器学习方法[变分自编码器模块化贝叶斯网络(VAMBN)]来学习纵向临床研究数据的生成模型。VAMBN考虑了此类数据的典型关键方面,即样本量有限,同时伴有许多具有不同数值尺度和统计特性的变量,以及大量缺失值。我们表明,使用VAMBN,我们可以以足够逼真的方式模拟虚拟患者,同时在数据隐私方面提供理论保障。此外,VAMBN允许模拟反事实场景。因此,VAMBN可以促进数据共享以及临床试验设计。