用于推进精准医学的统一临床词汇嵌入

Unified Clinical Vocabulary Embeddings for Advancing Precision Medicine.

作者信息

Johnson Ruth, Gottlieb Uri, Shaham Galit, Eisen Lihi, Waxman Jacob, Devons-Sberro Stav, Ginder Curtis R, Hong Peter, Sayeed Raheel, Reis Ben Y, Balicer Ran D, Dagan Noa, Zitnik Marinka

机构信息

The Ivan and Francesca Berkowitz Family Living Laboratory Collaboration at Harvard Medical School and Clalit Research Institute, Boston, MA, USA.

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.

出版信息

medRxiv. 2024 Dec 10:2024.12.03.24318322. doi: 10.1101/2024.12.03.24318322.

DOI:10.1101/2024.12.03.24318322

PMID:39677476

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11643188/

Abstract

Integrating clinical knowledge into AI remains challenging despite numerous medical guidelines and vocabularies. Medical codes, central to healthcare systems, often reflect operational patterns shaped by geographic factors, national policies, insurance frameworks, and physician practices rather than the precise representation of clinical knowledge. This disconnect hampers AI in representing clinical relationships, raising concerns about bias, transparency, and generalizability. Here, we developed a resource of 67,124 clinical vocabulary embeddings derived from a clinical knowledge graph tailored to electronic health record vocabularies, spanning over 1.3 million edges. Using graph transformer neural networks, we generated clinical vocabulary embeddings that provide a new representation of clinical knowledge by unifying seven medical vocabularies. These embeddings were validated through a phenotype risk score analysis involving 4.57 million patients from Clalit Healthcare Services, effectively stratifying individuals based on survival outcomes. Inter-institutional panels of clinicians evaluated the embeddings for alignment with clinical knowledge across 90 diseases and 3,000 clinical codes, confirming their robustness and transferability. This resource addresses gaps in integrating clinical vocabularies into AI models and training datasets, paving the way for knowledge-grounded population and patient-level models.

摘要

尽管有众多医学指南和词汇表，但将临床知识整合到人工智能中仍然具有挑战性。医疗编码是医疗系统的核心，通常反映了由地理因素、国家政策、保险框架和医生实践所形成的操作模式，而不是临床知识的精确表示。这种脱节阻碍了人工智能对临床关系的表示，引发了对偏差、透明度和通用性的担忧。在此，我们开发了一个包含67124个临床词汇嵌入的资源，这些嵌入来自一个针对电子健康记录词汇量身定制的临床知识图谱，该图谱跨越了超过130万条边。使用图变换器神经网络，我们生成了临床词汇嵌入，通过统一七种医学词汇，为临床知识提供了一种新的表示形式。这些嵌入通过对来自克拉利特医疗服务公司的457万名患者进行的表型风险评分分析得到验证，能够根据生存结果有效地对个体进行分层。临床医生的跨机构小组对这些嵌入进行了评估，以确定它们与90种疾病和3000个临床编码的临床知识的一致性，证实了它们的稳健性和可转移性。这一资源解决了将临床词汇整合到人工智能模型和训练数据集中的差距，为基于知识的人群和患者层面的模型铺平了道路。