Suppr超能文献

iSentenizer-μ:多语言句子边界检测模型。

iSentenizer-μ: multilingual sentence boundary detection model.

作者信息

Wong Derek F, Chao Lidia S, Zeng Xiaodong

机构信息

NLPCT Laboratory, Department of Computer and Information Science, University of Macau, Macau.

出版信息

ScientificWorldJournal. 2014;2014:196574. doi: 10.1155/2014/196574. Epub 2014 Apr 15.

Abstract

Sentence boundary detection (SBD) system is normally quite sensitive to genres of data that the system is trained on. The genres of data are often referred to the shifts of text topics and new languages domains. Although new detection models can be retrained for different languages or new text genres, previous model has to be thrown away and the creation process has to be restarted from scratch. In this paper, we present a multilingual sentence boundary detection system (iSentenizer-μ) for Danish, German, English, Spanish, Dutch, French, Italian, Portuguese, Greek, Finnish, and Swedish languages. The proposed system is able to detect the sentence boundaries of a mixture of different text genres and languages with high accuracy. We employ i (+)Learning algorithm, an incremental tree learning architecture, for constructing the system. iSentenizer-μ, under the incremental learning framework, is adaptable to text of different topics and Roman-alphabet languages, by merging new data into existing model to learn the new knowledge incrementally by revision instead of retraining. The system has been extensively evaluated on different languages and text genres and has been compared against two state-of-the-art SBD systems, Punkt and MaxEnt. The experimental results show that the proposed system outperforms the other systems on all datasets.

摘要

句子边界检测(SBD)系统通常对其训练所使用的数据类型非常敏感。数据类型通常指文本主题的变化和新的语言领域。尽管可以针对不同语言或新的文本类型重新训练新的检测模型,但之前的模型必须丢弃,并且创建过程必须从头重新开始。在本文中,我们提出了一种适用于丹麦语、德语、英语、西班牙语、荷兰语、法语、意大利语、葡萄牙语、希腊语、芬兰语和瑞典语的多语言句子边界检测系统(iSentenizer-μ)。所提出的系统能够高精度地检测不同文本类型和语言混合的句子边界。我们采用i(+)学习算法,一种增量树学习架构来构建该系统。在增量学习框架下,iSentenizer-μ通过将新数据合并到现有模型中,以修订而非重新训练的方式逐步学习新知识,从而适应不同主题的文本和罗马字母语言。该系统已在不同语言和文本类型上进行了广泛评估,并与两个最先进的SBD系统Punkt和MaxEnt进行了比较。实验结果表明,所提出的系统在所有数据集上均优于其他系统。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9d60/4030568/e594c4e92e26/TSWJ2014-196574.001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验