Suppr超能文献

混合集成规则算法提高 MEDLINE® 句子边界检测。

Hybrid Ensemble-Rule Algorithm for Improved MEDLINE® Sentence Boundary Detection.

机构信息

Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894.

出版信息

AMIA Annu Symp Proc. 2022 Feb 21;2021:677-686. eCollection 2021.

Abstract

Sentence boundary detection (SBD) is a fundamental building block in the Natural Language Processing (NLP) pipeline. Incorrect SBD may impact subsequent processing stages resulting in decreased performance. In well-behaved corpora, a few simple rules based on punctuation and capitalization are sufficient for successfully detecting sentence boundaries. However, a corpus like MEDLINE citations presents challenges for SBD due to several syntactic ambiguities, e.g., abbreviation-periods, capital letters in first words of sentences, etc. In this manuscript we present an algorithm to address these challenges based on majority voting among three SBD engines (Python NLTK, pySBD, and Syntok) followed by custom post-processing algorithms that rely on NLP spaCy part-of-speech, abbreviation and capital letter detection, and computing general sentence statistics. Experiments on several thousand MEDLINE citations show that our proposed approach for combining multiple SBD engines and post-processing rules performs better than each individual engine.

摘要

句子边界检测(SBD)是自然语言处理(NLP)管道的基本构建块。不正确的 SBD 可能会影响后续的处理阶段,导致性能下降。在行为良好的语料库中,基于标点符号和大写的一些简单规则足以成功检测句子边界。然而,像 MEDLINE 引文这样的语料库由于存在一些句法上的歧义,例如缩写-句号、句子首字母大写等,给 SBD 带来了挑战。在本文中,我们提出了一种基于多数投票的算法来解决这些挑战,该算法由三个 SBD 引擎(Python NLTK、pySBD 和 Syntok)组成,然后是依赖于 NLP spaCy 词性、缩写和大写字母检测以及计算一般句子统计信息的自定义后处理算法。对数千个 MEDLINE 引文的实验表明,我们提出的组合多个 SBD 引擎和后处理规则的方法比每个单独的引擎都要好。

相似文献

3
Detection of sentence boundaries and abbreviations in clinical narratives.临床叙述中句子边界和缩写的检测。
BMC Med Inform Decis Mak. 2015;15 Suppl 2(Suppl 2):S4. doi: 10.1186/1472-6947-15-S2-S4. Epub 2015 Jun 15.

本文引用的文献

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验