Department of Computer Science, COMSATS Institute of Information Technology, Abbottabad, Pakistan.
PLoS One. 2013 Aug 21;8(8):e68178. doi: 10.1371/journal.pone.0068178. eCollection 2013.
Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been proposed for Urdu text tokenization, along with some other algorithms to solve the additional issues of boundary detection of compound words, affixation, reduplication, names and abbreviations. This study resulted into 97.28% precision, 93.71% recall, and 95.46% F1-measure; while tokenizing a corpus of 57000 words by using a morpheme list with 6400 entries.
文本分词是几乎所有信息处理应用的基本预处理步骤。对于乌尔都语等资源匮乏的语言来说,这是一项不容易的任务,因为单词之间的空格使用不一致。本文提出了一种基于词素匹配的乌尔都语文本分词方法,以及一些其他算法来解决复合词、词缀、重叠、名称和缩写的边界检测等附加问题。这项研究在使用一个包含 6400 个词素的词素列表对一个包含 57000 个单词的语料库进行分词时,得到了 97.28%的精度、93.71%的召回率和 95.46%的 F1 度量值。