Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894.
AMIA Annu Symp Proc. 2022 Feb 21;2021:677-686. eCollection 2021.
Sentence boundary detection (SBD) is a fundamental building block in the Natural Language Processing (NLP) pipeline. Incorrect SBD may impact subsequent processing stages resulting in decreased performance. In well-behaved corpora, a few simple rules based on punctuation and capitalization are sufficient for successfully detecting sentence boundaries. However, a corpus like MEDLINE citations presents challenges for SBD due to several syntactic ambiguities, e.g., abbreviation-periods, capital letters in first words of sentences, etc. In this manuscript we present an algorithm to address these challenges based on majority voting among three SBD engines (Python NLTK, pySBD, and Syntok) followed by custom post-processing algorithms that rely on NLP spaCy part-of-speech, abbreviation and capital letter detection, and computing general sentence statistics. Experiments on several thousand MEDLINE citations show that our proposed approach for combining multiple SBD engines and post-processing rules performs better than each individual engine.
句子边界检测(SBD)是自然语言处理(NLP)管道的基本构建块。不正确的 SBD 可能会影响后续的处理阶段,导致性能下降。在行为良好的语料库中,基于标点符号和大写的一些简单规则足以成功检测句子边界。然而,像 MEDLINE 引文这样的语料库由于存在一些句法上的歧义,例如缩写-句号、句子首字母大写等,给 SBD 带来了挑战。在本文中,我们提出了一种基于多数投票的算法来解决这些挑战,该算法由三个 SBD 引擎(Python NLTK、pySBD 和 Syntok)组成,然后是依赖于 NLP spaCy 词性、缩写和大写字母检测以及计算一般句子统计信息的自定义后处理算法。对数千个 MEDLINE 引文的实验表明,我们提出的组合多个 SBD 引擎和后处理规则的方法比每个单独的引擎都要好。