iSentenizer-μ：多语言句子边界检测模型。

iSentenizer-μ: multilingual sentence boundary detection model.

作者信息

Wong Derek F, Chao Lidia S, Zeng Xiaodong

机构信息

NLPCT Laboratory, Department of Computer and Information Science, University of Macau, Macau.

出版信息

ScientificWorldJournal. 2014;2014:196574. doi: 10.1155/2014/196574. Epub 2014 Apr 15.

DOI:10.1155/2014/196574

PMID:24883358

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4030568/

Abstract

Sentence boundary detection (SBD) system is normally quite sensitive to genres of data that the system is trained on. The genres of data are often referred to the shifts of text topics and new languages domains. Although new detection models can be retrained for different languages or new text genres, previous model has to be thrown away and the creation process has to be restarted from scratch. In this paper, we present a multilingual sentence boundary detection system (iSentenizer-μ) for Danish, German, English, Spanish, Dutch, French, Italian, Portuguese, Greek, Finnish, and Swedish languages. The proposed system is able to detect the sentence boundaries of a mixture of different text genres and languages with high accuracy. We employ i (+)Learning algorithm, an incremental tree learning architecture, for constructing the system. iSentenizer-μ, under the incremental learning framework, is adaptable to text of different topics and Roman-alphabet languages, by merging new data into existing model to learn the new knowledge incrementally by revision instead of retraining. The system has been extensively evaluated on different languages and text genres and has been compared against two state-of-the-art SBD systems, Punkt and MaxEnt. The experimental results show that the proposed system outperforms the other systems on all datasets.

摘要

句子边界检测（SBD）系统通常对其训练所使用的数据类型非常敏感。数据类型通常指文本主题的变化和新的语言领域。尽管可以针对不同语言或新的文本类型重新训练新的检测模型，但之前的模型必须丢弃，并且创建过程必须从头重新开始。在本文中，我们提出了一种适用于丹麦语、德语、英语、西班牙语、荷兰语、法语、意大利语、葡萄牙语、希腊语、芬兰语和瑞典语的多语言句子边界检测系统（iSentenizer-μ）。所提出的系统能够高精度地检测不同文本类型和语言混合的句子边界。我们采用i(+)学习算法，一种增量树学习架构来构建该系统。在增量学习框架下，iSentenizer-μ通过将新数据合并到现有模型中，以修订而非重新训练的方式逐步学习新知识，从而适应不同主题的文本和罗马字母语言。该系统已在不同语言和文本类型上进行了广泛评估，并与两个最先进的SBD系统Punkt和MaxEnt进行了比较。实验结果表明，所提出的系统在所有数据集上均优于其他系统。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d60/4030568/e594c4e92e26/TSWJ2014-196574.001.jpg

相似文献

iSentenizer-μ: multilingual sentence boundary detection model.

ScientificWorldJournal. 2014;2014:196574. doi: 10.1155/2014/196574. Epub 2014 Apr 15.

Multilingual event extraction for epidemic detection.

Artif Intell Med. 2015 Oct;65(2):131-43. doi: 10.1016/j.artmed.2015.06.005. Epub 2015 Jul 17.

On the fractal patterns of language structures.

PLoS One. 2023 May 18;18(5):e0285630. doi: 10.1371/journal.pone.0285630. eCollection 2023.

Detection of sentence boundaries and abbreviations in clinical narratives.

BMC Med Inform Decis Mak. 2015;15 Suppl 2(Suppl 2):S4. doi: 10.1186/1472-6947-15-S2-S4. Epub 2015 Jun 15.

Estimating Sentence-like Structure in Synthetic Languages Using Information Topology.

Entropy (Basel). 2022 Jun 22;24(7):859. doi: 10.3390/e24070859.

Neural machine translation of clinical text: an empirical investigation into multilingual pre-trained language models and transfer-learning.

Front Digit Health. 2024 Feb 26;6:1211564. doi: 10.3389/fdgth.2024.1211564. eCollection 2024.

Lexical simplification benchmarks for English, Portuguese, and Spanish.

Front Artif Intell. 2022 Sep 22;5:991242. doi: 10.3389/frai.2022.991242. eCollection 2022.

On cross-lingual retrieval with multilingual text encoders.

Inf Retr Boston. 2022;25(2):149-183. doi: 10.1007/s10791-022-09406-x. Epub 2022 Mar 7.

Multilingual natural language generation as part of a medical terminology server.

Medinfo. 1995;8 Pt 1:100-4.

Multilingual part-of-speech tagging with weightless neural networks.

Neural Netw. 2015 Jun;66:11-21. doi: 10.1016/j.neunet.2015.02.012. Epub 2015 Mar 2.

本文引用的文献

Unsupervised chunking based on graph propagation from bilingual corpus.

ScientificWorldJournal. 2014 Mar 19;2014:401943. doi: 10.1155/2014/401943. eCollection 2014.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

iSentenizer-μ：多语言句子边界检测模型。

iSentenizer-μ: multilingual sentence boundary detection model.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献