Li Shumin, Wang Yiding, Liu Chi-Man, Huang Yuanhua, Lam Tak-Wah, Luo Ruibang
Department of Computer Science, School of Computing and Data Science, University of Hong Kong, Hong Kong, 999077, China.
School of Biomedical Sciences, University of Hong Kong, Hong Kong, 999077, China.
Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf382.
Rare diseases affect over 300 million people worldwide and are often caused by genetic variants. While variant detection has become cost-effective, interpreting these variants-particularly collecting literature-based evidence like ACMG/AMP PM3-remains complex and time-consuming.
We present AutoPM3, a method that automates PM3 evidence extraction from literatures using open-source large language models (LLMs). AutoPM3 combines a Text2SQL-based variant extractor and a retrieval-augmented generation (RAG) module, enhanced by a variant-specific retriever and fine-tuned LLM, to separately process tables and text. We curated PM3-Bench, a dataset of 1027 variant-publication evidence pairs from ClinGen. On openly accessible pairs, AutoPM3 achieved 86.1% accuracy for variant hits and 72.5% recall for in trans variants-outperforming other methods, including those using larger models. We uncovered the effectiveness of AutoPM3's key modules, especially for variant-specific retriever and Text2SQL, through the sequential ablation study. AutoPM3 located evidence in 76 s, demonstrating that open-source LLMs can offer an efficient, cost-effective solution for rare disease diagnosis.
AutoPM3 is implemented and freely available under the MIT license at https://github.com/HKU-BAL/AutoPM3.
罕见病影响着全球超过3亿人,通常由基因变异引起。虽然变异检测已变得具有成本效益,但解读这些变异——尤其是收集像美国医学遗传学与基因组学学会/美国病理学家协会(ACMG/AMP)PM3这样基于文献的证据——仍然复杂且耗时。
我们提出了AutoPM3,这是一种使用开源大语言模型(LLM)从文献中自动提取PM3证据的方法。AutoPM3结合了基于文本到SQL的变异提取器和检索增强生成(RAG)模块,并通过变异特异性检索器和微调的LLM进行增强,以分别处理表格和文本。我们整理了PM3-Bench,这是一个来自临床基因组资源(ClinGen)的包含1027个变异-文献证据对的数据集。在公开可获取的对上,AutoPM3在变异命中方面的准确率达到86.1%,在反式变异方面的召回率达到72.5%,优于其他方法,包括那些使用更大模型的方法。通过顺序消融研究,我们发现了AutoPM3关键模块的有效性,特别是变异特异性检索器和文本到SQL模块。AutoPM3在76秒内找到了证据,表明开源LLM可以为罕见病诊断提供高效、经济有效的解决方案。
AutoPM3已实现,并根据麻省理工学院许可在https://github.com/HKU-BAL/AutoPM3上免费提供。