O'Connell Nathaniel Sean, Speiser Jaime Lynn
Department of Biostatistics and Data Science, Wake Forest University School of Medicine, Winston-Salem, NC, 27157, USA.
BMC Med Res Methodol. 2025 Apr 10;25(1):92. doi: 10.1186/s12874-025-02548-8.
Clustered data arise when observations are correlated within a group or sampling unit and frequently arise in epidemiology, social sciences, education, linguistics, econometrics, and medicine. Given growing interest in clustered data, we developed a data repository offering clustered datasets that can be used for methodologic comparison with open-source, publicly available data. Traditionally, data simulation studies are employed for methodology evaluation and comparison, which can be fraught with issues such as overly simplistic design and potential for bias. Excellent data repositories are available for standard (non-clustered) datasets, such as OpenML and the Penn Machine Learning Benchmark repository, but there is a paucity of resources available that have clustered data.
In this pilot study, we developed an R package called OpenClustered, which includes 19 clustered datasets with binary outcomes arising from various domains and varying in terms of their size and composition. We present tutorials for using OpenClustered, including examples for filtering and summarizing the datasets. We demonstrate the use of OpenClustered with a small benchmarking study comparing Frequentist and Bayesian implementations of generalized linear mixed models. All code and data are contained on the OpenClustered GitHub page.
The OpenClustered R package is the start of a useful data resource for conducting benchmarking studies with open-source clustered data. It facilitates empirical methodologic guidance that is less prone to bias compared to data simulation studies, thereby improving rigor across diverse research fields. In the future, we plan to add more datasets, particularly those with continuous outcomes, as well as functionality for users to submit their clustered datasets to be included in the repository.
当观察值在一个组或抽样单元内相关时,就会出现聚类数据,这种数据在流行病学、社会科学、教育、语言学、计量经济学和医学中经常出现。鉴于对聚类数据的兴趣日益浓厚,我们开发了一个数据存储库,提供可用于与开源、公开可用数据进行方法比较的聚类数据集。传统上,数据模拟研究用于方法评估和比较,但可能存在诸如设计过于简单和潜在偏差等问题。对于标准(非聚类)数据集,有一些优秀的数据存储库,如OpenML和宾夕法尼亚机器学习基准存储库,但缺乏包含聚类数据的资源。
在这项试点研究中,我们开发了一个名为OpenClustered的R包,其中包括19个具有二元结果的聚类数据集,这些数据集来自不同领域,在大小和组成方面各不相同。我们提供了使用OpenClustered的教程,包括数据集过滤和汇总的示例。我们通过一项小型基准研究展示了OpenClustered的使用,该研究比较了广义线性混合模型的频率学派和贝叶斯实现。所有代码和数据都包含在OpenClustered的GitHub页面上。
OpenClustered R包是使用开源聚类数据进行基准研究的有用数据资源的开端。它促进了实证方法指导,与数据模拟研究相比,这种指导更不易产生偏差,从而提高了不同研究领域的严谨性。未来,我们计划添加更多数据集,特别是那些具有连续结果的数据集,以及供用户提交其聚类数据集以纳入存储库的功能。