Yazdani Shahram, Henry Ronald Claude, Byrne Avery, Henry Isaac Claude
Department of Pediatrics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA 90095, United States.
Department of Civil Engineering, University of Southern California, Los Angeles, CA 90089, United States.
J Am Med Inform Assoc. 2025 Mar 1;32(3):526-534. doi: 10.1093/jamia/ocae314.
This study evaluates the utility of word embeddings, generated by large language models (LLMs), for medical diagnosis by comparing the semantic proximity of symptoms to their eponymic disease embedding ("eponymic condition") and the mean of all symptom embeddings associated with a disease ("ensemble mean").
Symptom data for 5 diagnostically challenging pediatric diseases-CHARGE syndrome, Cowden disease, POEMS syndrome, Rheumatic fever, and Tuberous sclerosis-were collected from PubMed. Using the Ada-002 embedding model, disease names and symptoms were translated into vector representations in a high-dimensional space. Euclidean and Chebyshev distance metrics were used to classify symptoms based on their proximity to both the eponymic condition and the ensemble mean of the condition's symptoms.
The ensemble mean approach showed significantly higher classification accuracy, correctly classifying between 80% (Cowden disease) to 100% (Tuberous sclerosis) of the sample disease symptoms using the Euclidean distance metric. In contrast, the eponymic condition approach using Euclidian distance metric and Chebyshev distances, in general, showed poor symptom classification performance, with erratic results (0%-100% accuracy), largely ranging between 0% and 3% accuracy.
The ensemble mean captures a disease's collective symptom profile, providing a more nuanced representation than the disease name alone. However, some misclassifications were due to superficial semantic similarities, highlighting the need for LLM models trained on medical corpora.
The ensemble mean of symptom embeddings improves classification accuracy over the eponymic condition approach. Future efforts should focus on medical-specific training of LLMs to enhance their diagnostic accuracy and clinical utility.
本研究通过比较症状与其同名疾病嵌入(“同名病症”)的语义接近度以及与疾病相关的所有症状嵌入的平均值(“总体平均值”),评估由大语言模型(LLMs)生成的词嵌入在医学诊断中的效用。
从PubMed收集了5种诊断具有挑战性的儿科疾病——CHARGE综合征、考登病、POEMS综合征、风湿热和结节性硬化症的症状数据。使用Ada - 002嵌入模型,将疾病名称和症状转换为高维空间中的向量表示。使用欧几里得距离和切比雪夫距离度量,根据症状与同名病症及其症状总体平均值的接近程度对症状进行分类。
总体平均值方法显示出显著更高的分类准确率,使用欧几里得距离度量正确分类了样本疾病症状的80%(考登病)至100%(结节性硬化症)。相比之下,使用欧几里得距离度量和切比雪夫距离的同名病症方法总体上显示出较差的症状分类性能,结果不稳定(准确率为0% - 100%),大多在0%至3%的准确率之间。
总体平均值捕捉了疾病的集体症状特征,提供了比单独疾病名称更细致入微的表示。然而,一些错误分类是由于表面的语义相似性,这凸显了对在医学语料库上训练的大语言模型的需求。
症状嵌入的总体平均值比同名病症方法提高了分类准确率。未来的工作应专注于大语言模型的医学特定训练,以提高其诊断准确性和临床效用。