Rae Alastair R, Savery Max E, Mork James G, Demner-Fushman Dina
Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD.
AMIA Annu Symp Proc. 2020 Mar 4;2019:727-734. eCollection 2019.
MEDLINE is the National Library of Medicine's premier bibliographic database for biomedical literature. A highly valuable feature of the database is that each record is manually indexed with a controlled vocabulary called MeSH. Most MEDLINE journals are indexed cover-to-cover, but there are about 200 selectively indexed journals for which only articles related to biomedicine and life sciences are indexed. In recent years, the selection process has become an increasing burden for indexing staff, and this paper presents a machine learning based system that offers very significant time savings by semi-automating the task. At the core of the system is a high recall classifier for the identification of journal articles that are in-scope for MEDLINE. The system is shown to reduce the number of articles requiring manual review by 54%, equivalent to approximately 40,000 articles per year.
医学文献数据库(MEDLINE)是美国国立医学图书馆用于生物医学文献的首要书目数据库。该数据库的一个极具价值的特点是,每条记录都使用一种称为医学主题词表(MeSH)的受控词汇进行人工索引。大多数MEDLINE期刊都进行了全面索引,但约有200种期刊是选择性索引,仅索引与生物医学和生命科学相关的文章。近年来,筛选过程给索引编制人员带来了越来越大的负担,本文介绍了一种基于机器学习的系统,该系统通过半自动化任务显著节省了时间。该系统的核心是一个高召回率分类器,用于识别属于MEDLINE收录范围的期刊文章。该系统被证明可将需要人工审核的文章数量减少54%,相当于每年约40000篇文章。