Suppr超能文献

基于词素匹配的稀缺资源语言文本分词。

Morpheme matching based text tokenization for a scarce resourced language.

机构信息

Department of Computer Science, COMSATS Institute of Information Technology, Abbottabad, Pakistan.

出版信息

PLoS One. 2013 Aug 21;8(8):e68178. doi: 10.1371/journal.pone.0068178. eCollection 2013.

Abstract

Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been proposed for Urdu text tokenization, along with some other algorithms to solve the additional issues of boundary detection of compound words, affixation, reduplication, names and abbreviations. This study resulted into 97.28% precision, 93.71% recall, and 95.46% F1-measure; while tokenizing a corpus of 57000 words by using a morpheme list with 6400 entries.

摘要

文本分词是几乎所有信息处理应用的基本预处理步骤。对于乌尔都语等资源匮乏的语言来说,这是一项不容易的任务,因为单词之间的空格使用不一致。本文提出了一种基于词素匹配的乌尔都语文本分词方法,以及一些其他算法来解决复合词、词缀、重叠、名称和缩写的边界检测等附加问题。这项研究在使用一个包含 6400 个词素的词素列表对一个包含 57000 个单词的语料库进行分词时,得到了 97.28%的精度、93.71%的召回率和 95.46%的 F1 度量值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/43f5/3749178/1c4da635511f/pone.0068178.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验