Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin, Berlin, Germany.
Institute of Software Technology (IST), Koblenz University, Koblenz, Germany.
Orphanet J Rare Dis. 2024 Jul 15;19(1):265. doi: 10.1186/s13023-024-03254-2.
Globally, researchers are working on projects aiming to enhance the availability of data for rare disease research. While data sharing remains critical, developing suitable methods is challenging due to the specific sensitivity and uniqueness of rare disease data. This creates a dilemma, as there is a lack of both methods and necessary data to create appropriate approaches initially. This work contributes to bridging this gap by providing synthetic datasets that can form the foundation for such developments.
Using a hierarchical data generation approach parameterised with publicly available statistics, we generated datasets reflecting a random sample of rare disease patients from the United States (US) population. General demographics were obtained from the US Census Bureau, while information on disease prevalence, initial diagnosis, survival rates as well as race and sex ratios were obtained from the information provided by the US Centers for Disease Control and Prevention as well as the scientific literature. The software, which we have named SynthMD, was implemented in Python as open source using libraries such as Faker for generating individual data points.
We generated three datasets focusing on three specific rare diseases with broad impact on US citizens, as well as differences in affected genders and racial groups: Sickle Cell Disease, Cystic Fibrosis, and Duchenne Muscular Dystrophy. We present the statistics used to generate the datasets and study the statistical properties of output data. The datasets, as well as the code used to generate them, are available as Open Data and Open Source Software.
The results of our work can serve as a starting point for researchers and developers working on methods and platforms that aim to improve the availability of rare disease data. Potential applications include using the datasets for testing purposes during the implementation of information systems or tailored privacy-enhancing technologies.
在全球范围内,研究人员正在致力于项目,旨在提高罕见病研究数据的可用性。虽然数据共享仍然至关重要,但由于罕见病数据的特殊性和独特性,开发合适的方法具有挑战性。这造成了一个困境,因为缺乏用于创建初步适当方法的数据和方法。通过提供可形成此类发展基础的合成数据集,这项工作有助于弥合这一差距。
使用参数化的分层数据生成方法,并利用公开的统计数据,我们生成了反映美国(US)人群中随机罕见病患者样本的数据集。一般人口统计数据来自美国人口普查局,而疾病流行率、初始诊断、生存率以及种族和性别比例等信息则来自美国疾病控制与预防中心以及科学文献提供的信息。我们将该软件命名为 SynthMD,它是使用 Python 实现的开源软件,使用 Faker 等库生成各个数据点。
我们生成了三个专注于三种具有广泛影响美国公民的特定罕见疾病的数据集,以及在受影响性别和种族群体方面的差异:镰状细胞病、囊性纤维化和杜氏肌营养不良症。我们介绍了用于生成数据集的统计信息,并研究了输出数据的统计特性。数据集以及用于生成它们的代码都作为开放数据和开源软件提供。
我们工作的结果可以为致力于提高罕见病数据可用性的方法和平台的研究人员和开发人员提供起点。潜在的应用包括在实施信息系统或定制隐私增强技术期间,将数据集用于测试目的。