混合集成规则算法提高 MEDLINE® 句子边界检测。

Hybrid Ensemble-Rule Algorithm for Improved MEDLINE® Sentence Boundary Detection.

机构信息

Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894.

出版信息

AMIA Annu Symp Proc. 2022 Feb 21;2021:677-686. eCollection 2021.

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8861722/

Abstract

Sentence boundary detection (SBD) is a fundamental building block in the Natural Language Processing (NLP) pipeline. Incorrect SBD may impact subsequent processing stages resulting in decreased performance. In well-behaved corpora, a few simple rules based on punctuation and capitalization are sufficient for successfully detecting sentence boundaries. However, a corpus like MEDLINE citations presents challenges for SBD due to several syntactic ambiguities, e.g., abbreviation-periods, capital letters in first words of sentences, etc. In this manuscript we present an algorithm to address these challenges based on majority voting among three SBD engines (Python NLTK, pySBD, and Syntok) followed by custom post-processing algorithms that rely on NLP spaCy part-of-speech, abbreviation and capital letter detection, and computing general sentence statistics. Experiments on several thousand MEDLINE citations show that our proposed approach for combining multiple SBD engines and post-processing rules performs better than each individual engine.

摘要

句子边界检测（SBD）是自然语言处理（NLP）管道的基本构建块。不正确的 SBD 可能会影响后续的处理阶段，导致性能下降。在行为良好的语料库中，基于标点符号和大写的一些简单规则足以成功检测句子边界。然而，像 MEDLINE 引文这样的语料库由于存在一些句法上的歧义，例如缩写-句号、句子首字母大写等，给 SBD 带来了挑战。在本文中，我们提出了一种基于多数投票的算法来解决这些挑战，该算法由三个 SBD 引擎（Python NLTK、pySBD 和 Syntok）组成，然后是依赖于 NLP spaCy 词性、缩写和大写字母检测以及计算一般句子统计信息的自定义后处理算法。对数千个 MEDLINE 引文的实验表明，我们提出的组合多个 SBD 引擎和后处理规则的方法比每个单独的引擎都要好。

相似文献

1

Hybrid Ensemble-Rule Algorithm for Improved MEDLINE® Sentence Boundary Detection.混合集成规则算法提高 MEDLINE® 句子边界检测。

AMIA Annu Symp Proc. 2022 Feb 21;2021:677-686. eCollection 2021.

2

A Quantitative and Qualitative Evaluation of Sentence Boundary Detection for the Clinical Domain.临床领域句子边界检测的定量与定性评估

AMIA Jt Summits Transl Sci Proc. 2016 Jul 20;2016:88-97. eCollection 2016.

3

Detection of sentence boundaries and abbreviations in clinical narratives.临床叙述中句子边界和缩写的检测。

BMC Med Inform Decis Mak. 2015;15 Suppl 2(Suppl 2):S4. doi: 10.1186/1472-6947-15-S2-S4. Epub 2015 Jun 15.

4

Trie-based rule processing for clinical NLP: A use-case study of n-trie, making the ConText algorithm more efficient and scalable.基于 Trie 的规则处理在临床自然语言处理中的应用：n-trie 的使用案例研究，使 ConText 算法更高效、更具可扩展性。

J Biomed Inform. 2018 Sep;85:106-113. doi: 10.1016/j.jbi.2018.08.002. Epub 2018 Aug 6.

5

Recurrent Deep Network Models for Clinical NLP Tasks: Use Case with Sentence Boundary Disambiguation.用于临床自然语言处理任务的循环深度网络模型：句子边界消歧用例

Stud Health Technol Inform. 2019 Aug 21;264:198-202. doi: 10.3233/SHTI190211.

6

A grammar-based semantic similarity algorithm for natural language sentences.一种基于语法的自然语言句子语义相似度算法。

ScientificWorldJournal. 2014;2014:437162. doi: 10.1155/2014/437162. Epub 2014 Apr 10.

7

Social Reminiscence in Older Adults' Everyday Conversations: Automated Detection Using Natural Language Processing and Machine Learning.老年人日常对话中的社会怀旧：使用自然语言处理和机器学习的自动检测。

J Med Internet Res. 2020 Sep 15;22(9):e19133. doi: 10.2196/19133.

8

MedScan, a natural language processing engine for MEDLINE abstracts.MedScan，一款用于医学在线数据库摘要的自然语言处理引擎。

Bioinformatics. 2003 Sep 1;19(13):1699-706. doi: 10.1093/bioinformatics/btg207.

9

An unsupervised machine learning approach to segmentation of clinician-entered free text.一种用于对临床医生录入的自由文本进行分割的无监督机器学习方法。

AMIA Annu Symp Proc. 2007 Oct 11;2007:811-5.

10

Recognition of medication information from discharge summaries using ensembles of classifiers.使用分类器集成识别出院小结中的药物信息。

BMC Med Inform Decis Mak. 2012 May 7;12:36. doi: 10.1186/1472-6947-12-36.

本文引用的文献

1

A Quantitative and Qualitative Evaluation of Sentence Boundary Detection for the Clinical Domain.临床领域句子边界检测的定量与定性评估

AMIA Jt Summits Transl Sci Proc. 2016 Jul 20;2016:88-97. eCollection 2016.

2

Detection of sentence boundaries and abbreviations in clinical narratives.临床叙述中句子边界和缩写的检测。

BMC Med Inform Decis Mak. 2015;15 Suppl 2(Suppl 2):S4. doi: 10.1186/1472-6947-15-S2-S4. Epub 2015 Jun 15.

3

2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text.2010 i2b2/VA 挑战赛：临床文本中的概念、断言和关系

J Am Med Inform Assoc. 2011 Sep-Oct;18(5):552-6. doi: 10.1136/amiajnl-2011-000203. Epub 2011 Jun 16.

4

GENIA corpus--semantically annotated corpus for bio-textmining.GENIA语料库——用于生物文本挖掘的语义标注语料库。

Bioinformatics. 2003;19 Suppl 1:i180-2. doi: 10.1093/bioinformatics/btg1023.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验