Bryan Jordan G, Niu Hongqian, Li Didong
Department of Biostatistics, The University of North Carolina at Chapel Hill.
bioRxiv. 2025 May 5:2025.04.30.651464. doi: 10.1101/2025.04.30.651464.
We propose strategies for incorporating the information in large language models (LLMs) into statistical hypothesis tests in genomics studies. Using gene embeddings derived from text inputs to OpenAI's GPT-3.5 model, we show that biological signals in a variety of genomics datasets reside near the principal subspace spanned by the embeddings. We then use a frequentist and Bayesian (FAB) framework to propose three hypothesis tests that are optimal with respect to prior information based on the gene embedding subspace. In three separate real-world genomics examples, the FAB tests guided by the LLM-derived information achieve more power than classical counterparts.
我们提出了将大语言模型(LLMs)中的信息纳入基因组学研究中的统计假设检验的策略。利用从OpenAI的GPT-3.5模型的文本输入中衍生出的基因嵌入,我们表明各种基因组学数据集中的生物信号位于由这些嵌入所跨越的主子空间附近。然后,我们使用一个频率主义和贝叶斯(FAB)框架来提出三个基于基因嵌入子空间的先验信息而言最优的假设检验。在三个独立的真实世界基因组学实例中,由大语言模型衍生的信息所指导的FAB检验比传统检验具有更强的效力。