Suppr超能文献

MarkerGenie:一个用于生物医学实体关系提取的支持自然语言处理的文本挖掘系统。

MarkerGenie: an NLP-enabled text-mining system for biomedical entity relation extraction.

作者信息

Gu Wenhao, Yang Xiao, Yang Minhao, Han Kun, Pan Wenying, Zhu Zexuan

机构信息

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China.

GeneGenieDx Corp, San Jose, CA 95134, USA.

出版信息

Bioinform Adv. 2022 May 13;2(1):vbac035. doi: 10.1093/bioadv/vbac035. eCollection 2022.

Abstract

MOTIVATION

Natural language processing (NLP) tasks aim to convert unstructured text data (e.g. articles or dialogues) to structured information. In recent years, we have witnessed fundamental advances of NLP technique, which has been widely used in many applications such as financial text mining, news recommendation and machine translation. However, its application in the biomedical space remains challenging due to a lack of labeled data, ambiguities and inconsistencies of biological terminology. In biomedical marker discovery studies, tools that rely on NLP models to automatically and accurately extract relations of biomedical entities are valuable as they can provide a more thorough survey of all available literature, hence providing a less biased result compared to manual curation. In addition, the fast speed of machine reader helps quickly orient research and development.

RESULTS

To address the aforementioned needs, we developed automatic training data labeling, rule-based biological terminology cleaning and a more accurate NLP model for binary associative and multi-relation prediction into the program. We demonstrated the effectiveness of the proposed methods in identifying relations between biomedical entities on various benchmark datasets and case studies.

AVAILABILITY AND IMPLEMENTATION

MarkerGenie is available at https://www.genegeniedx.com/markergenie/. Data for model training and evaluation, term lists of biomedical entities, details of the case studies and all trained models are provided at https://drive.google.com/drive/folders/14RypiIfIr3W_K-mNIAx9BNtObHSZoAyn?usp=sharing.

SUPPLEMENTARY INFORMATION

Supplementary data are available at online.

摘要

动机

自然语言处理(NLP)任务旨在将非结构化文本数据(如文章或对话)转换为结构化信息。近年来,我们见证了NLP技术的重大进展,该技术已广泛应用于许多领域,如金融文本挖掘、新闻推荐和机器翻译。然而,由于缺乏标注数据、生物学术语的模糊性和不一致性,其在生物医学领域的应用仍然具有挑战性。在生物医学标志物发现研究中,依赖NLP模型自动准确提取生物医学实体关系的工具很有价值,因为它们可以对所有可用文献进行更全面的调查,因此与人工整理相比,结果偏差更小。此外,机器阅读器的快速速度有助于快速定位研发方向。

结果

为满足上述需求,我们在程序中开发了自动训练数据标注、基于规则的生物学术语清理以及用于二元关联和多关系预测的更准确的NLP模型。我们在各种基准数据集和案例研究中证明了所提出方法在识别生物医学实体之间关系方面的有效性。

可用性和实现方式

MarkerGenie可在https://www.genegeniedx.com/markergenie/获取。模型训练和评估数据、生物医学实体术语列表、案例研究细节以及所有训练模型可在https://drive.google.com/drive/folders/14RypiIfIr3W_K-mNIAx9BNtObHSZoAyn?usp=sharing获取。

补充信息

补充数据可在网上获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验