人增强型大型语言模型驱动的谷胱甘肽过氧化物酶 4 选择作为循环红细胞血液转录生物标志物的候选物。

Human-augmented large language model-driven selection of glutathione peroxidase 4 as a candidate blood transcriptional biomarker for circulating erythroid cells.

机构信息

The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA.

Williams College, Williamstown, MA, USA.

出版信息

Sci Rep. 2024 Oct 5;14(1):23225. doi: 10.1038/s41598-024-73916-5.

DOI:10.1038/s41598-024-73916-5

PMID:39369090

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11455862/

Abstract

The identification of optimal candidate genes from large-scale blood transcriptomic data is crucial for developing targeted assays to monitor immune responses. Here, we introduce a novel, optimized large language model (LLM)-based approach for prioritizing candidate biomarkers from blood transcriptional modules. Focusing on module M14.51 from the BloodGen3 repertoire, we implemented a multi-step LLM-driven workflow. Initial high-throughput screening used GPT-4, Claude 3, and Claude 3.5 Sonnet to score and rank the module's constituent genes across six criteria. Top candidates then underwent high-resolution scoring using Consensus GPT, with concurrent manual fact-checking and, when needed, iterative refinement of the scores based on user feedback. Qualitative assessment of literature-based narratives and analysis of reference transcriptome data further refined the selection process. This novel multi-tiered approach consistently identified Glutathione Peroxidase 4 (GPX4) as the top candidate gene for module M14.51. GPX4's role in oxidative stress regulation, its potential as a future drug target, and its expression pattern across diverse cell types supported its selection. The incorporation of reference transcriptome data further validated GPX4 as the most suitable candidate for this module. This study presents an advanced LLM-driven workflow with a novel optimized scoring strategy for candidate gene prioritization, incorporating human-in-the-loop augmentation. The approach identified GPX4 as a key gene in the erythroid cell-associated module M14.51, suggesting its potential utility for biomarker discovery and targeted assay development. By combining AI-driven literature analysis with iterative human expert validation, this method leverages the strengths of both artificial and human intelligence, potentially contributing to the development of biologically relevant and clinically informative targeted assays. Further validation studies are needed to confirm the broader applicability of this human-augmented AI approach.

摘要

从大规模血液转录组数据中鉴定出最佳候选基因对于开发靶向检测来监测免疫反应至关重要。在这里，我们介绍了一种新颖的、优化的基于大型语言模型（LLM）的方法，用于从血液转录模块中优先考虑候选生物标志物。我们专注于 BloodGen3 库中的模块 M14.51，实施了一个多步骤的 LLM 驱动的工作流程。最初的高通量筛选使用 GPT-4、Claude 3 和 Claude 3.5 Sonnet 对模块的组成基因进行评分和排名，共使用了六个标准。然后，使用 Consensus GPT 对顶级候选基因进行高分辨率评分，同时进行手动事实检查，并且在需要时根据用户反馈迭代地调整评分。基于文献的叙述的定性评估和参考转录组数据的分析进一步完善了选择过程。这种新颖的多层次方法一致地将谷胱甘肽过氧化物酶 4（GPX4）鉴定为模块 M14.51 的顶级候选基因。GPX4 在氧化应激调节中的作用、作为未来药物靶点的潜力以及在各种细胞类型中的表达模式支持了它的选择。参考转录组数据的纳入进一步验证了 GPX4 是该模块最适合的候选基因。本研究提出了一种先进的基于 LLM 的工作流程，具有新颖的优化评分策略，用于候选基因优先级排序，并结合了人工增强。该方法确定了 GPX4 是与红细胞相关的模块 M14.51 中的关键基因，表明其在生物标志物发现和靶向检测开发方面具有潜在的应用价值。通过将人工智能驱动的文献分析与迭代的人工专家验证相结合，该方法利用了人工智能和人类智能的优势，可能有助于开发具有生物学相关性和临床信息的靶向检测。需要进一步的验证研究来确认这种人工增强的人工智能方法的更广泛适用性。