Suppr超能文献

MetaboListem和TABoLiSTM:两种用于代谢物命名实体识别的深度学习算法。

MetaboListem and TABoLiSTM: Two Deep Learning Algorithms for Metabolite Named Entity Recognition.

作者信息

Yeung Cheng S, Beck Tim, Posma Joram M

机构信息

Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Faculty of Medicine, Imperial College London, London SW7 2AZ, UK.

Department of Genetics and Genome Biology, University of Leicester, Leicester LE1 7RH, UK.

出版信息

Metabolites. 2022 Mar 22;12(4):276. doi: 10.3390/metabo12040276.

Abstract

Reviewing the metabolomics literature is becoming increasingly difficult because of the rapid expansion of relevant journal literature. Text-mining technologies are therefore needed to facilitate more efficient literature reviews. Here we contribute a standardised corpus of full-text publications from metabolomics studies and describe the development of two metabolite named entity recognition (NER) methods. These methods are based on Bidirectional Long Short-Term Memory (BiLSTM) networks and each incorporate different transfer learning techniques (for tokenisation and word embedding). Our first model (MetaboListem) follows prior methodology using GloVe word embeddings. Our second model exploits BERT and BioBERT for embedding and is named TABoLiSTM (Transformer-Affixed BiLSTM). The methods are trained on a novel corpus annotated using rule-based methods, and evaluated on manually annotated metabolomics articles. MetaboListem (F1-score 0.890, precision 0.892, recall 0.888) and TABoLiSTM (BioBERT version: F1-score 0.909, precision 0.926, recall 0.893) have achieved state-of-the-art performance on metabolite NER. A training corpus with full-text sentences from >1000 full-text Open Access metabolomics publications with 105,335 annotated metabolites was created, as well as a manually annotated test corpus (19,138 annotations). This work demonstrates that deep learning algorithms are capable of identifying metabolite names accurately and efficiently in text. The proposed corpus and NER algorithms can be used for metabolomics text-mining tasks such as information retrieval, document classification and literature-based discovery and are available from the omicsNLP GitHub repository.

摘要

由于相关期刊文献的迅速扩充,回顾代谢组学文献变得越来越困难。因此,需要文本挖掘技术来促进更高效的文献回顾。在此,我们提供了一个来自代谢组学研究的全文出版物标准化语料库,并描述了两种代谢物命名实体识别(NER)方法的开发。这些方法基于双向长短期记忆(BiLSTM)网络,并且每种方法都结合了不同的迁移学习技术(用于分词和词嵌入)。我们的第一个模型(MetaboListem)遵循使用GloVe词嵌入的先前方法。我们的第二个模型利用BERT和BioBERT进行嵌入,名为TABoLiSTM(Transformer附加BiLSTM)。这些方法在使用基于规则的方法注释的新型语料库上进行训练,并在人工注释的代谢组学文章上进行评估。MetaboListem(F1分数0.890,精确率0.892,召回率0.888)和TABoLiSTM(BioBERT版本:F1分数0.909,精确率0.926,召回率0.893)在代谢物NER方面取得了领先的性能。创建了一个训练语料库,其中包含来自1000多篇全文开放获取代谢组学出版物的全文句子,带有105335个注释的代谢物,以及一个人工注释的测试语料库(19138个注释)。这项工作表明,深度学习算法能够在文本中准确有效地识别代谢物名称。所提出的语料库和NER算法可用于代谢组学文本挖掘任务,如信息检索、文档分类和基于文献的发现,可从omicsNLP GitHub代码库获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8da9/9031427/91268964ce54/metabolites-12-00276-g0A1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验