School of Software Engineering, South China University of Technology, Guangzhou, China.
School of Data Science and Information Engineering, Guizhou Minzu University, Guiyang, China.
BMC Genomics. 2024 Nov 9;25(1):1062. doi: 10.1186/s12864-024-10978-9.
Sequencing-based genetic testing is widely used in biomedical research, including pathogenic microorganism detection with metagenomic next-generation sequencing (mNGS). The application of sequencing results to clinical diagnosis and treatment relies on various interpretation knowledge bases. Currently, the existing knowledge bases are primarily built through manual knowledge extraction. This method requires professionals to read extensive literature and extract relevant knowledge from it, which is time-consuming and costly. Furthermore, manual extraction unavoidably introduces subjective biases. In this study, we aimed to automatically extract knowledge for interpreting mNGS results.
We propose a novel approach to automatically extract pathogenic microorganism knowledge based on the question-answer (QA) model. First, we construct a MicrobeDB dataset since there is no available pathogenic microorganism QA dataset for training the model. The created dataset contains 3,161 samples from 618 published papers covering 224 pathogenic microorganisms. Then, we fine-tune the selected baseline model based on MicrobeDB. Finally, we utilize ChatGPT to enhance the diversity of training data, and employ data expansion to increase training data volume.
Our method achieves an Exact Match (EM) and F1 score of 88.39% and 93.18%, respectively, on the MicrobeDB test set. We also conduct ablation studies on the proposed data augmentation method. In addition, we perform comparative experiments with the ChatPDF tool based on the ChatGPT API to demonstrate the effectiveness of the proposed method.
Our method is effective and valuable for extracting pathogenic microorganism knowledge.
基于测序的基因检测广泛应用于生物医学研究,包括宏基因组下一代测序(mNGS)检测致病微生物。测序结果在临床诊断和治疗中的应用依赖于各种解释知识库。目前,现有的知识库主要是通过手动知识提取构建的。这种方法需要专业人员阅读大量文献并从中提取相关知识,既耗时又昂贵。此外,手动提取不可避免地会引入主观偏见。在这项研究中,我们旨在自动提取 mNGS 结果的解释知识。
我们提出了一种基于问答(QA)模型自动提取致病微生物知识的新方法。首先,我们构建了一个 MicrobeDB 数据集,因为目前没有可用的致病微生物 QA 数据集来训练模型。创建的数据集包含 3161 个样本,来自 618 篇已发表的论文,涵盖 224 种致病微生物。然后,我们根据 MicrobeDB 微调选定的基线模型。最后,我们利用 ChatGPT 来增强训练数据的多样性,并采用数据扩展来增加训练数据量。
我们的方法在 MicrobeDB 测试集上的精确匹配(EM)和 F1 分数分别达到 88.39%和 93.18%。我们还对所提出的数据增强方法进行了消融研究。此外,我们还基于 ChatGPT API 与 ChatPDF 工具进行了对比实验,以证明所提出方法的有效性。
我们的方法对于提取致病微生物知识是有效和有价值的。