Suppr超能文献

设计一个通用的开放平台,用于在生物医学文献数据库PubMed中对文章进行机器学习辅助索引和聚类。

Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database.

作者信息

Smalheiser Neil R, Cohen Aaron M

机构信息

Department of Psychiatry and Psychiatric Institute, University of Illinois College of Medicine, 1601 West Taylor Street, MC912, Chicago, IL 60612

Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, Oregon, USA 97239.

出版信息

Data Inf Manag. 2018 Jun;2(1):27-36. doi: 10.2478/dim-2018-0004. Epub 2018 May 22.

Abstract

Many investigators have carried out text mining of the biomedical literature for a variety of purposes, ranging from the assignment of indexing terms to the disambiguation of author names. A common approach is to define positive and negative training examples, extract features from article metadata, and employ machine learning algorithms. At present, each research group tackles each problem from scratch, and in isolation of other projects, which causes redundancy and great waste of effort. Here, we propose and describe the design of a generic platform for biomedical text mining, which can serve as a shared resource for machine learning projects, and can serve as a public repository for their outputs. We will initially focus on a specific goal, namely, classifying articles according to Publication Type, and emphasize how feature sets can be made more powerful and robust through the use of multiple, heterogeneous similarity measures as input to machine learning models. We then discuss how the generic platform can be extended to include a wide variety of other machine learning based goals and projects, and can be used as a public platform for disseminating the results of NLP tools to end-users as well.

摘要

许多研究人员出于各种目的对生物医学文献进行文本挖掘,从索引词的分配到作者姓名的消歧。一种常见的方法是定义正例和负例训练样本,从文章元数据中提取特征,并使用机器学习算法。目前,每个研究小组都从零开始处理每个问题,并且与其他项目孤立开来,这导致了冗余和精力的极大浪费。在此,我们提出并描述了一个用于生物医学文本挖掘的通用平台的设计,该平台可以作为机器学习项目的共享资源,并可以作为其输出的公共存储库。我们最初将专注于一个特定目标,即根据出版类型对文章进行分类,并强调如何通过使用多种异构相似性度量作为机器学习模型的输入,使特征集更强大、更稳健。然后,我们讨论如何扩展通用平台,以包括各种各样基于机器学习的目标和项目,并且还可以用作向最终用户传播自然语言处理工具结果的公共平台。

相似文献

本文引用的文献

4
Text mining resources for the life sciences.生命科学的文本挖掘资源。
Database (Oxford). 2016 Nov 25;2016. doi: 10.1093/database/baw145. Print 2016.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验