MarkerGenie：一个用于生物医学实体关系提取的支持自然语言处理的文本挖掘系统。

MarkerGenie: an NLP-enabled text-mining system for biomedical entity relation extraction.

作者信息

Gu Wenhao, Yang Xiao, Yang Minhao, Han Kun, Pan Wenying, Zhu Zexuan

机构信息

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China.

GeneGenieDx Corp, San Jose, CA 95134, USA.

出版信息

Bioinform Adv. 2022 May 13;2(1):vbac035. doi: 10.1093/bioadv/vbac035. eCollection 2022.

DOI:10.1093/bioadv/vbac035

PMID:36699388

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9710573/

Abstract

MOTIVATION

Natural language processing (NLP) tasks aim to convert unstructured text data (e.g. articles or dialogues) to structured information. In recent years, we have witnessed fundamental advances of NLP technique, which has been widely used in many applications such as financial text mining, news recommendation and machine translation. However, its application in the biomedical space remains challenging due to a lack of labeled data, ambiguities and inconsistencies of biological terminology. In biomedical marker discovery studies, tools that rely on NLP models to automatically and accurately extract relations of biomedical entities are valuable as they can provide a more thorough survey of all available literature, hence providing a less biased result compared to manual curation. In addition, the fast speed of machine reader helps quickly orient research and development.

RESULTS

To address the aforementioned needs, we developed automatic training data labeling, rule-based biological terminology cleaning and a more accurate NLP model for binary associative and multi-relation prediction into the program. We demonstrated the effectiveness of the proposed methods in identifying relations between biomedical entities on various benchmark datasets and case studies.

AVAILABILITY AND IMPLEMENTATION

MarkerGenie is available at https://www.genegeniedx.com/markergenie/. Data for model training and evaluation, term lists of biomedical entities, details of the case studies and all trained models are provided at https://drive.google.com/drive/folders/14RypiIfIr3W_K-mNIAx9BNtObHSZoAyn?usp=sharing.

SUPPLEMENTARY INFORMATION

Supplementary data are available at online.

摘要

动机

自然语言处理（NLP）任务旨在将非结构化文本数据（如文章或对话）转换为结构化信息。近年来，我们见证了NLP技术的重大进展，该技术已广泛应用于许多领域，如金融文本挖掘、新闻推荐和机器翻译。然而，由于缺乏标注数据、生物学术语的模糊性和不一致性，其在生物医学领域的应用仍然具有挑战性。在生物医学标志物发现研究中，依赖NLP模型自动准确提取生物医学实体关系的工具很有价值，因为它们可以对所有可用文献进行更全面的调查，因此与人工整理相比，结果偏差更小。此外，机器阅读器的快速速度有助于快速定位研发方向。

结果

为满足上述需求，我们在程序中开发了自动训练数据标注、基于规则的生物学术语清理以及用于二元关联和多关系预测的更准确的NLP模型。我们在各种基准数据集和案例研究中证明了所提出方法在识别生物医学实体之间关系方面的有效性。

可用性和实现方式

MarkerGenie可在https://www.genegeniedx.com/markergenie/获取。模型训练和评估数据、生物医学实体术语列表、案例研究细节以及所有训练模型可在https://drive.google.com/drive/folders/14RypiIfIr3W_K-mNIAx9BNtObHSZoAyn?usp=sharing获取。

补充信息

补充数据可在网上获取。

相似文献

MarkerGenie: an NLP-enabled text-mining system for biomedical entity relation extraction.

Bioinform Adv. 2022 May 13;2(1):vbac035. doi: 10.1093/bioadv/vbac035. eCollection 2022.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

STonKGs: a sophisticated transformer trained on biomedical text and knowledge graphs.

Bioinformatics. 2022 Mar 4;38(6):1648-1656. doi: 10.1093/bioinformatics/btac001.

A comparison of word embeddings for the biomedical natural language processing.

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Bioformer: an efficient transformer language model for biomedical text mining.

ArXiv. 2023 Feb 3:arXiv:2302.01588v1.

Biomedical named entity recognition and linking datasets: survey and our recent development.

Brief Bioinform. 2020 Dec 1;21(6):2219-2238. doi: 10.1093/bib/bbaa054.

Automatically Detecting Failures in Natural Language Processing Tools for Online Community Text.

J Med Internet Res. 2015 Aug 31;17(8):e212. doi: 10.2196/jmir.4612.

An Ontology-Enabled Natural Language Processing Pipeline for Provenance Metadata Extraction from Biomedical Text (Short Paper).

On Move Meaningful Internet Syst. 2016 Oct;10033:699-708. doi: 10.1007/978-3-319-48472-3_43. Epub 2016 Oct 18.

BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets.

J Biomed Inform. 2023 Oct;146:104487. doi: 10.1016/j.jbi.2023.104487. Epub 2023 Sep 4.

BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.

PLoS Comput Biol. 2020 Apr 23;16(4):e1007617. doi: 10.1371/journal.pcbi.1007617. eCollection 2020 Apr.

引用本文的文献

PuMA: PubMed gene/cell type-relation Atlas.

BMC Bioinformatics. 2025 Jul 29;26(1):201. doi: 10.1186/s12859-025-06236-8.

Bridging artificial intelligence and biological sciences: a comprehensive review of large language models in bioinformatics.

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf357.

Current Bioinformatics Tools in Precision Oncology.

MedComm (2020). 2025 Jul 9;6(7):e70243. doi: 10.1002/mco2.70243. eCollection 2025 Jul.

A natural language processing system for the efficient extraction of cell markers.

Sci Rep. 2024 Sep 11;14(1):21183. doi: 10.1038/s41598-024-72204-6.

Investigating Cross-Domain Binary Relation Classification in Biomedical Natural Language Processing.

AMIA Jt Summits Transl Sci Proc. 2024 May 31;2024:384-390. eCollection 2024.

本文引用的文献

DrugShot: querying biomedical search terms to retrieve prioritized lists of small molecules.

BMC Bioinformatics. 2022 Feb 19;23(1):76. doi: 10.1186/s12859-022-04590-5.

Epigenetic study of early breast cancer (EBC) based on DNA methylation and gene integration analysis.

Sci Rep. 2022 Feb 7;12(1):1989. doi: 10.1038/s41598-022-05486-3.

Rho GTPase gene expression and breast cancer risk: a Mendelian randomization analysis.

Sci Rep. 2022 Jan 27;12(1):1463. doi: 10.1038/s41598-022-05549-5.

Hp-Positive Chinese Patients Should Undergo Colonoscopy Earlier and More Frequently: The Result of a Cross-Sectional Study Based on 13,037 Cases of Gastrointestinal Endoscopy.

Front Oncol. 2021 Aug 26;11:698898. doi: 10.3389/fonc.2021.698898. eCollection 2021.

IBDDB: a manually curated and text-mining-enhanced database of genes involved in inflammatory bowel disease.

Database (Oxford). 2021 Apr 30;2021. doi: 10.1093/database/baab022.

Association of the microbiome with colorectal cancer development (Review).

Int J Oncol. 2021 May;58(5). doi: 10.3892/ijo.2021.5197. Epub 2021 Mar 24.

The prevalence of human papillomavirus in colorectal cancer and adenoma: A meta-analysis.

J Cancer Res Ther. 2020;16(7):1656-1663. doi: 10.4103/jcrt.JCRT_636_20.

Exploring the Role of Gut Microbiome in Colon Cancer.

Appl Biochem Biotechnol. 2021 Jun;193(6):1780-1799. doi: 10.1007/s12010-021-03498-9. Epub 2021 Jan 25.

Clinical, pathological, and PAM50 gene expression features of HER2-low breast cancer.

NPJ Breast Cancer. 2021 Jan 4;7(1):1. doi: 10.1038/s41523-020-00208-2.

MarkerDB: an online database of molecular biomarkers.

Nucleic Acids Res. 2021 Jan 8;49(D1):D1259-D1267. doi: 10.1093/nar/gkaa1067.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

MarkerGenie：一个用于生物医学实体关系提取的支持自然语言处理的文本挖掘系统。

MarkerGenie: an NLP-enabled text-mining system for biomedical entity relation extraction.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现方式

补充信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献