Hillis Ethan, Bhattarai Kriti, Abrams Zachary
Institute for Informatics, Data Science and Biostatistics, Washington University School of Medicine in St. Louis, St. Louis, MO 63110, USA.
Department of Computer Science, Washington University in St. Louis, St. Louis, MO 63130, USA.
J Pers Med. 2024 Sep 25;14(10):1022. doi: 10.3390/jpm14101022.
Genetic data play a crucial role in diagnosing and treating various diseases, reflecting a growing imperative to integrate these data into clinical care. However, significant barriers such as the structure of electronic health records (EHRs), insurance costs for genetic testing, and the interpretability of genetic results impede this integration.
This paper explores solutions to these challenges by combining recent technological advances with informatics and data science, focusing on the diagnostic potential of artificial intelligence (AI) in cancer research. AI has historically been applied in medical research with limited success, but recent developments have led to the emergence of large language models (LLMs). These transformer-based generative AI models, trained on vast datasets, offer significant potential for genetic and genomic analyses. However, their effectiveness is constrained by their training on predominantly human-written text rather than comprehensive, structured genetic datasets.
This study reevaluates the capabilities of LLMs, specifically GPT models, in performing supervised prediction tasks using structured gene expression data. By comparing GPT models with traditional machine learning approaches, we assess their effectiveness in predicting cancer subtypes, demonstrating the potential of AI models to analyze real-world genetic data for generating real-world evidence.
基因数据在各种疾病的诊断和治疗中起着至关重要的作用,这反映出将这些数据整合到临床护理中的紧迫性日益增加。然而,诸如电子健康记录(EHR)的结构、基因检测的保险成本以及基因检测结果的可解释性等重大障碍阻碍了这种整合。
本文通过将近期的技术进步与信息学和数据科学相结合,探索应对这些挑战的解决方案,重点关注人工智能(AI)在癌症研究中的诊断潜力。人工智能在历史上应用于医学研究的成果有限,但最近的发展导致了大语言模型(LLM)的出现。这些基于Transformer的生成式人工智能模型在大量数据集上进行训练,为基因和基因组分析提供了巨大潜力。然而,它们的有效性受到其主要在人类编写的文本而非全面的结构化基因数据集上训练的限制。
本研究重新评估了大语言模型,特别是GPT模型,在使用结构化基因表达数据执行监督预测任务方面的能力。通过将GPT模型与传统机器学习方法进行比较,我们评估了它们在预测癌症亚型方面的有效性,证明了人工智能模型分析真实世界基因数据以生成真实世界证据的潜力。