使用机器学习识别生物医学文献中的软件名称。

Recognizing software names in biomedical literature using machine learning.

机构信息

The University of Texas Health Science Center at Houston, USA.

Johns Hopkins University, USA.

出版信息

Health Informatics J. 2020 Mar;26(1):21-33. doi: 10.1177/1460458219869490. Epub 2019 Sep 30.

DOI:10.1177/1460458219869490

PMID:31566474

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7334865/

Abstract

Software tools now are essential to research and applications in the biomedical domain. However, existing software repositories are mainly built using manual curation, which is time-consuming and unscalable. This study took the initiative to manually annotate software names in 1,120 MEDLINE abstracts and titles and used this corpus to develop and evaluate machine learning-based named entity recognition systems for biomedical software. Specifically, two strategies were proposed for feature engineering: (1) domain knowledge features and (2) unsupervised word representation features of clustered and binarized word embeddings. Our best system achieved an F-measure of 91.79% for recognizing software from titles and an F-measure of 86.35% for recognizing software from both titles and abstracts using inexact matching criteria. We then created a biomedical software catalog with 19,557 entries using the developed system. This study demonstrates the feasibility of using natural language processing methods to automatically build a high-quality software index from biomedical literature.

摘要

软件工具现在是生物医学领域研究和应用的基础。然而，现有的软件库主要是通过手动策展构建的，这种方法既耗时又不可扩展。本研究主动对 1120 篇 MEDLINE 摘要和标题中的软件名称进行了手动注释，并使用该语料库开发和评估了基于机器学习的生物医学软件命名实体识别系统。具体来说，我们提出了两种特征工程策略：（1）领域知识特征和（2）聚类和二值化词嵌入的无监督词表示特征。我们的最佳系统在使用不精确匹配标准时，在从标题中识别软件方面的 F1 值达到了 91.79%，在从标题和摘要中识别软件方面的 F1 值达到了 86.35%。然后，我们使用开发的系统创建了一个包含 19557 条记录的生物医学软件目录。本研究证明了使用自然语言处理方法从生物医学文献中自动构建高质量软件索引的可行性。

相似文献

Recognizing software names in biomedical literature using machine learning.使用机器学习识别生物医学文献中的软件名称。

Health Informatics J. 2020 Mar;26(1):21-33. doi: 10.1177/1460458219869490. Epub 2019 Sep 30.

Recognizing names in biomedical texts: a machine learning approach.识别生物医学文本中的名称：一种机器学习方法。

Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Learning adaptive representations for entity recognition in the biomedical domain.学习生物医学领域中实体识别的自适应表示。

J Biomed Semantics. 2021 May 17;12(1):10. doi: 10.1186/s13326-021-00238-0.

Feature selection techniques for maximum entropy based biomedical named entity recognition.基于最大熵的生物医学命名实体识别的特征选择技术。

J Biomed Inform. 2009 Oct;42(5):905-11. doi: 10.1016/j.jbi.2008.12.012. Epub 2009 Jan 23.

Character-level neural network for biomedical named entity recognition.用于生物医学命名实体识别的字符级神经网络。

J Biomed Inform. 2017 Jun;70:85-91. doi: 10.1016/j.jbi.2017.05.002. Epub 2017 May 11.

Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning.基于领域知识和无监督特征学习的专利中化学命名实体识别

Database (Oxford). 2016 Apr 17;2016. doi: 10.1093/database/baw049. Print 2016.

Evaluating word representation features in biomedical named entity recognition tasks.评估生物医学命名实体识别任务中的词表示特征。

Biomed Res Int. 2014;2014:240403. doi: 10.1155/2014/240403. Epub 2014 Mar 6.

A modular framework for biomedical concept recognition.生物医学概念识别的模块化框架。

BMC Bioinformatics. 2013 Sep 24;14:281. doi: 10.1186/1471-2105-14-281.

An adverse drug effect mentions extraction method based on weighted online recurrent extreme learning machine.一种基于加权在线循环极端学习机的药物不良反应提取方法。

Comput Methods Programs Biomed. 2019 Jul;176:33-41. doi: 10.1016/j.cmpb.2019.04.029. Epub 2019 Apr 30.

引用本文的文献

Evolution and emerging trends of named entity recognition: Bibliometric analysis from 2000 to 2023.命名实体识别的发展与新兴趋势：2000年至2023年的文献计量分析

Heliyon. 2024 Apr 22;10(9):e30053. doi: 10.1016/j.heliyon.2024.e30053. eCollection 2024 May 15.

本文引用的文献

CLAMP - a toolkit for efficiently building customized clinical natural language processing pipelines.CLAMP - 一个用于高效构建定制化临床自然语言处理管道的工具包。

J Am Med Inform Assoc. 2018 Mar 1;25(3):331-336. doi: 10.1093/jamia/ocx132.

A Survey of Bioinformatics Database and Software Usage through Mining the Literature.通过文献挖掘对生物信息学数据库和软件使用情况的调查

PLoS One. 2016 Jun 22;11(6):e0157989. doi: 10.1371/journal.pone.0157989. eCollection 2016.

Resource Disambiguator for the Web: Extracting Biomedical Resources and Their Citations from the Scientific Literature.网络资源消歧器：从科学文献中提取生物医学资源及其引用信息

PLoS One. 2016 Jan 5;11(1):e0146300. doi: 10.1371/journal.pone.0146300. eCollection 2016.

Tools and data services registry: a community effort to document bioinformatics resources.工具与数据服务注册库：记录生物信息学资源的社区协作项目。

Nucleic Acids Res. 2016 Jan 4;44(D1):D38-47. doi: 10.1093/nar/gkv1116. Epub 2015 Nov 3.

A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature.条件随机场与结构化支持向量机在生物医学文献中化学实体识别的比较。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S8. doi: 10.1186/1758-2946-7-S1-S8. eCollection 2015.

OMICtools: an informative directory for multi-omic data analysis.OMICtools：一个用于多组学数据分析的信息目录。

Database (Oxford). 2014 Jul 14;2014. doi: 10.1093/database/bau069. Print 2014.

bioNerDS: exploring bioinformatics' database and software use through literature mining.生物信息学数据库和软件的文献挖掘研究。

BMC Bioinformatics. 2013 Jun 15;14:194. doi: 10.1186/1471-2105-14-194.

BioJS: an open source JavaScript framework for biological data visualization.BioJS：用于生物数据可视化的开源 JavaScript 框架。

Bioinformatics. 2013 Apr 15;29(8):1103-4. doi: 10.1093/bioinformatics/btt100. Epub 2013 Feb 23.

BioCatalogue: a universal catalogue of web services for the life sciences.生物目录：生命科学领域的通用网络服务目录。

Nucleic Acids Res. 2010 Jul;38(Web Server issue):W689-94. doi: 10.1093/nar/gkq394. Epub 2010 May 19.

Bioconductor: open software development for computational biology and bioinformatics.生物导体：用于计算生物学和生物信息学的开源软件开发。

Genome Biol. 2004;5(10):R80. doi: 10.1186/gb-2004-5-10-r80. Epub 2004 Sep 15.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用机器学习识别生物医学文献中的软件名称。

Recognizing software names in biomedical literature using machine learning.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献