Key Laboratory of Cancer Prevention and Therapy, Tianjin Cancer Institute, Tianjin's Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, National Clinical Research Center for Cancer, Tianjin Medical University, Tianjin, 300060, China.
Department of Epidemiology and Biostatistics, Key Laboratory of Molecular Cancer Epidemiology of Tianjin, Key Laboratory of Cancer Prevention and Therapy, Tianjin's Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, National Clinical Research Center for Cancer, Tianjin Medical University, Tianjin, 300060, China.
Brief Bioinform. 2024 Jul 25;25(5). doi: 10.1093/bib/bbae430.
Instruction-tuned large language models (LLMs) demonstrate exceptional ability to align with human intentions. We present an LLM-based model-instruction-tuned LLM for assessment of cancer (iLLMAC)-that can detect cancer using cell-free deoxyribonucleic acid (cfDNA) end-motif profiles. Developed on plasma cfDNA sequencing data from 1135 cancer patients and 1106 controls across three datasets, iLLMAC achieved area under the receiver operating curve (AUROC) of 0.866 [95% confidence interval (CI), 0.773-0.959] for cancer diagnosis and 0.924 (95% CI, 0.841-1.0) for hepatocellular carcinoma (HCC) detection using 16 end-motifs. Performance increased with more motifs, reaching 0.886 (95% CI, 0.794-0.977) and 0.956 (95% CI, 0.89-1.0) for cancer diagnosis and HCC detection, respectively, with 64 end-motifs. On an external-testing set, iLLMAC achieved AUROC of 0.912 (95% CI, 0.849-0.976) for cancer diagnosis and 0.938 (95% CI, 0.885-0.992) for HCC detection with 64 end-motifs, significantly outperforming benchmarked methods. Furthermore, iLLMAC achieved high classification performance on datasets with bisulfite and 5-hydroxymethylcytosine sequencing. Our study highlights the effectiveness of LLM-based instruction-tuning for cfDNA-based cancer detection.
指令调优的大型语言模型(LLM)表现出与人类意图高度一致的能力。我们提出了一种基于 LLM 的模型-指令调优的 LLM,用于评估癌症(iLLMAC)-它可以使用无细胞脱氧核糖核酸(cfDNA)末端基序谱来检测癌症。该模型在三个数据集的 1135 名癌症患者和 1106 名对照的血浆 cfDNA 测序数据上进行了开发,iLLMAC 在癌症诊断方面的曲线下面积(AUROC)为 0.866[95%置信区间(CI),0.773-0.959],使用 16 个末端基序检测肝癌(HCC)的 AUROC 为 0.924[95%CI,0.841-1.0]。随着基序数量的增加,性能有所提高,使用 64 个末端基序时,癌症诊断和 HCC 检测的 AUROC 分别达到 0.886[95%CI,0.794-0.977]和 0.956[95%CI,0.89-1.0]。在外部测试集上,iLLMAC 在癌症诊断方面的 AUROC 为 0.912[95%CI,0.849-0.976],在 HCC 检测方面的 AUROC 为 0.938[95%CI,0.885-0.992],使用 64 个末端基序,显著优于基准方法。此外,iLLMAC 在使用亚硫酸氢盐和 5-羟甲基胞嘧啶测序的数据集上实现了高分类性能。我们的研究强调了基于 LLM 的指令调优在基于 cfDNA 的癌症检测中的有效性。