Yang Doris, Zhou Doudou, Cai Steven, Gan Ziming, Pencina Michael, Avillach Paul, Cai Tianxi, Hong Chuan
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States.
Department of Statistics and Data Science, National University of Singapore, Singapore, Singapore.
JMIR Med Inform. 2025 Jan 22;13:e54133. doi: 10.2196/54133.
Cohort studies contain rich clinical data across large and diverse patient populations and are a common source of observational data for clinical research. Because large scale cohort studies are both time and resource intensive, one alternative is to harmonize data from existing cohorts through multicohort studies. However, given differences in variable encoding, accurate variable harmonization is difficult.
We propose SONAR (Semantic and Distribution-Based Harmonization) as a method for harmonizing variables across cohort studies to facilitate multicohort studies.
SONAR used semantic learning from variable descriptions and distribution learning from study participant data. Our method learned an embedding vector for each variable and used pairwise cosine similarity to score the similarity between variables. This approach was built off 3 National Institutes of Health cohorts, including the Cardiovascular Health Study, the Multi-Ethnic Study of Atherosclerosis, and the Women's Health Initiative. We also used gold standard labels to further refine the embeddings in a supervised manner.
The method was evaluated using manually curated gold standard labels from the 3 National Institutes of Health cohorts. We evaluated both the intracohort and intercohort variable harmonization performance. The supervised SONAR method outperformed existing benchmark methods for almost all intracohort and intercohort comparisons using area under the curve and top-k accuracy metrics. Notably, SONAR was able to significantly improve harmonization of concepts that were difficult for existing semantic methods to harmonize.
SONAR achieves accurate variable harmonization within and between cohort studies by harnessing the complementary strengths of semantic learning and variable distribution learning.
队列研究包含来自大量不同患者群体的丰富临床数据,是临床研究观察数据的常见来源。由于大规模队列研究既耗费时间又耗费资源,一种替代方法是通过多队列研究来整合现有队列的数据。然而,鉴于变量编码的差异,准确的变量整合很困难。
我们提出SONAR(基于语义和分布的整合)作为一种在队列研究中整合变量以促进多队列研究的方法。
SONAR使用来自变量描述的语义学习和来自研究参与者数据的分布学习。我们的方法为每个变量学习一个嵌入向量,并使用成对余弦相似度来衡量变量之间的相似度。这种方法基于3个美国国立卫生研究院的队列构建,包括心血管健康研究、动脉粥样硬化多民族研究和妇女健康倡议。我们还使用金标准标签以监督方式进一步优化嵌入。
使用从3个美国国立卫生研究院队列中人工整理的金标准标签对该方法进行评估。我们评估了队列内和队列间的变量整合性能。使用曲线下面积和前k准确率指标,在几乎所有队列内和队列间比较中,有监督的SONAR方法优于现有的基准方法。值得注意的是,SONAR能够显著改善现有语义方法难以整合的概念的整合。
SONAR通过利用语义学习和变量分布学习的互补优势,在队列研究内部和之间实现了准确的变量整合。