Domagalski Marcin J, Lu Yin, Pilozzi Alexander, Williamson Alicia, Chilappagari Padmini, Luker Emma, Shelley Courtney D, Dabic Anya, Keller Michael A, Rodriguez Rebecca M, Lawlor Sharon, Thangudu Ratna R
Health Analytics, Research and Technology (HART), ICF, Rockville, MD 20850, United States.
Health and Life Sciences, Booz Allen Hamilton, Inc., McLean, VA 22102, United States.
J Am Med Inform Assoc. 2025 Oct 1;32(10):1609-1616. doi: 10.1093/jamia/ocaf114.
The success of artificial intelligence (AI) and machine learning (ML) approaches in biomedical research depends on the quality of the underlying data. The National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) Data Centric Challenge was designed to address the challenge of making raw clinical research data AI ready, with a focus on type 1 diabetes studies available in the NIDDK Central Repository (NIDDK-CR). This paper aims to present a structured methodology for enhancing the AI readiness of clinical datasets.
We detail a systematic approach for data aggregation and preprocessing, including binning continuous data, processing text features, managing missing values, and encoding for categorical variables while maintaining the data integrity and compatibility with ML algorithms.
We applied the proposed methodology to transform raw clinical data from type 1 diabetes studies in the NIDDK-CR into a structured, AI-ready dataset. The evaluation process validated the effectiveness of our AI-readiness enhancement steps and explored the potential use cases in type 1 diabetes research.
The methodology discussed in this paper will serve as guidance for preparing data for AI-driven clinical research, with the resulting AI-ready data to serve as a training tool for building and improving AI/ML model performance.
We present a generalizable framework for preparing clinical research data for AI applications. The resulting datasets lay a strong foundation for downstream AI/ML applications, setting the stage for a new era of data-driven discoveries.
人工智能(AI)和机器学习(ML)方法在生物医学研究中的成功取决于基础数据的质量。美国国立糖尿病、消化和肾脏疾病研究所(NIDDK)以数据为中心的挑战旨在应对使原始临床研究数据适用于AI的挑战,重点关注NIDDK中央存储库(NIDDK-CR)中可用的1型糖尿病研究。本文旨在提出一种结构化方法,以提高临床数据集对AI的适用性。
我们详细介绍了一种数据聚合和预处理的系统方法,包括对连续数据进行分箱、处理文本特征、管理缺失值以及对分类变量进行编码,同时保持数据完整性并与ML算法兼容。
我们应用所提出的方法将NIDDK-CR中1型糖尿病研究的原始临床数据转换为结构化的、适用于AI的数据集。评估过程验证了我们提高数据对AI适用性步骤的有效性,并探索了1型糖尿病研究中的潜在用例。
本文讨论的方法将为AI驱动的临床研究数据准备提供指导,生成的适用于AI的数据将作为构建和改进AI/ML模型性能的训练工具。
我们提出了一个可推广的框架,用于为AI应用准备临床研究数据。生成的数据集为下游AI/ML应用奠定了坚实基础,为数据驱动发现的新时代奠定了基础。