Department of Biomedical Informatics, Harvard University, Cambridge, MA 02138, USA,
Pac Symp Biocomput. 2020;25:647-658.
Clinical trials generate a large amount of data that have been underutilized due to obstacles that prevent data sharing including risking patient privacy, data misrepresentation, and invalid secondary analyses. In order to address these obstacles, we developed a novel data sharing method which ensures patient privacy while also protecting the interests of clinical trial investigators. Our flexible and robust approach involves two components: (1) an advanced cloud-based querying language that allows users to test hypotheses without direct access to the real clinical trial data and (2) corresponding synthetic data for the query of interest that allows for exploratory research and model development. Both components can be modified by the clinical trial investigator depending on factors such as the type of trial or number of patients enrolled. To test the effectiveness of our system, we first implement a simple and robust permutation based synthetic data generator. We then use the synthetic data generator coupled with our querying language to identify significant relationships among variables in a realistic clinical trial dataset.
临床试验产生了大量的数据,但由于存在一些障碍,如可能危及患者隐私、数据失真和无效的二次分析等,导致这些数据尚未得到充分利用。为了解决这些障碍,我们开发了一种新的数据共享方法,在确保患者隐私的同时,也保护了临床试验研究者的利益。我们的灵活和强大的方法包括两个组件:(1)一种先进的基于云的查询语言,允许用户在不直接访问真实临床试验数据的情况下测试假设,以及(2)用于查询感兴趣内容的相应合成数据,以允许进行探索性研究和模型开发。临床试验研究者可以根据试验类型或入组患者数量等因素来修改这两个组件。为了测试我们系统的有效性,我们首先实现了一个简单而强大的基于排列的合成数据生成器。然后,我们使用合成数据生成器和查询语言来识别现实临床试验数据集中变量之间的显著关系。