• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于机器学习的低资源语言细粒度分词与增强文本规范化框架。

Machine learning based framework for fine-grained word segmentation and enhanced text normalization for low resourced language.

作者信息

Nazir Shahzad, Asif Muhammad, Rehman Mariam, Ahmad Shahbaz

机构信息

Department of Computer Science, National Textile University, Faisalabad, Pakistan.

Department of Information Technology, Government College University, Faisalabad, Faisalabad, Pakistan.

出版信息

PeerJ Comput Sci. 2024 Jan 31;10:e1704. doi: 10.7717/peerj-cs.1704. eCollection 2024.

DOI:10.7717/peerj-cs.1704
PMID:39669469
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11636738/
Abstract

In text applications, pre-processing is deemed as a significant parameter to enhance the outcomes of natural language processing (NLP) chores. Text normalization and tokenization are two pivotal procedures of text pre-processing that cannot be overstated. Text normalization refers to transforming raw text into scriptural standardized text, while word tokenization splits the text into tokens or words. Well defined normalization and tokenization approaches exist for most spoken languages in world. However, the world's 10th most widely spoken language has been overlooked by the research community. This research presents improved text normalization and tokenization techniques for the Urdu language. For Urdu text normalization, multiple regular expressions and rules are proposed, including removing diuretics, normalizing single characters, separating digits, . While for word tokenization, core features are defined and extracted against each character of text. Machine learning model is considered with specified handcrafted rules to predict the space and to tokenize the text. This experiment is performed, while creating the largest human-annotated dataset composed in Urdu script covering five different domains. The results have been evaluated using precision, recall, F-measure, and accuracy. Further, the results are compared with state-of-the-art. The normalization approach produced 20% and tokenization approach achieved 6% improvement.

摘要

在文本应用中,预处理被视为提升自然语言处理(NLP)任务结果的一个重要参数。文本规范化和词元化是文本预处理的两个关键步骤,其重要性再怎么强调也不为过。文本规范化是指将原始文本转换为符合书写规范的标准化文本,而词元化则是将文本拆分为词元或单词。对于世界上大多数口语语言,都存在定义明确的规范化和词元化方法。然而,世界上第十大使用最广泛的语言却被研究界忽视了。本研究提出了改进的乌尔都语文本规范化和词元化技术。对于乌尔都语文本规范化,提出了多个正则表达式和规则,包括去除利尿剂、规范化单个字符、分隔数字等。而对于词元化,针对文本的每个字符定义并提取核心特征。考虑使用具有特定手工制作规则的机器学习模型来预测空格并对文本进行词元化。在创建由乌尔都语脚本编写的涵盖五个不同领域的最大人工标注数据集的同时进行了该实验。使用精确率、召回率、F1值和准确率对结果进行了评估。此外,还将结果与现有最佳技术进行了比较。规范化方法产生了20%的提升,词元化方法实现了6%的提升。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6012/11636738/67fb2ff48029/peerj-cs-10-1704-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6012/11636738/f6e9c330cca9/peerj-cs-10-1704-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6012/11636738/68b4edbc4759/peerj-cs-10-1704-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6012/11636738/8f4e704789f4/peerj-cs-10-1704-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6012/11636738/c8842fccab78/peerj-cs-10-1704-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6012/11636738/5bab6d69d21b/peerj-cs-10-1704-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6012/11636738/b9bf32c69550/peerj-cs-10-1704-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6012/11636738/67fb2ff48029/peerj-cs-10-1704-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6012/11636738/f6e9c330cca9/peerj-cs-10-1704-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6012/11636738/68b4edbc4759/peerj-cs-10-1704-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6012/11636738/8f4e704789f4/peerj-cs-10-1704-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6012/11636738/c8842fccab78/peerj-cs-10-1704-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6012/11636738/5bab6d69d21b/peerj-cs-10-1704-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6012/11636738/b9bf32c69550/peerj-cs-10-1704-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6012/11636738/67fb2ff48029/peerj-cs-10-1704-g007.jpg

相似文献

1
Machine learning based framework for fine-grained word segmentation and enhanced text normalization for low resourced language.基于机器学习的低资源语言细粒度分词与增强文本规范化框架。
PeerJ Comput Sci. 2024 Jan 31;10:e1704. doi: 10.7717/peerj-cs.1704. eCollection 2024.
2
Morpheme matching based text tokenization for a scarce resourced language.基于词素匹配的稀缺资源语言文本分词。
PLoS One. 2013 Aug 21;8(8):e68178. doi: 10.1371/journal.pone.0068178. eCollection 2013.
3
Paraphrase detection for Urdu language text using fine-tune BiLSTM framework.使用微调双向长短期记忆网络(BiLSTM)框架对乌尔都语文本进行释义检测。
Sci Rep. 2025 May 2;15(1):15383. doi: 10.1038/s41598-025-93260-6.
4
Effect of tokenization on transformers for biological sequences.词元化对生物序列变压器模型的影响。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae196.
5
A conditional random field based approach for high-accuracy part-of-speech tagging using language-independent features.一种基于条件随机场的方法,用于使用与语言无关的特征进行高精度词性标注。
PeerJ Comput Sci. 2024 Dec 11;10:e2577. doi: 10.7717/peerj-cs.2577. eCollection 2024.
6
Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization.使用代表性标记方案和细粒度标记化增强化学化合物和药物名称识别。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S14. doi: 10.1186/1758-2946-7-S1-S14. eCollection 2015.
7
Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization.基于混合机器学习模型和超参数优化的罗马 Urdu 仇恨言论检测
Sci Rep. 2024 Nov 19;14(1):28590. doi: 10.1038/s41598-024-79106-7.
8
Cursive-Text: A Comprehensive Dataset for End-to-End Urdu Text Recognition in Natural Scene Images.连笔文本:用于自然场景图像中乌尔都语文本端到端识别的综合数据集。
Data Brief. 2020 May 21;31:105749. doi: 10.1016/j.dib.2020.105749. eCollection 2020 Aug.
9
An unsupervised machine learning approach to segmentation of clinician-entered free text.一种用于对临床医生录入的自由文本进行分割的无监督机器学习方法。
AMIA Annu Symp Proc. 2007 Oct 11;2007:811-5.
10
Abstractive text summarization of low-resourced languages using deep learning.使用深度学习对低资源语言进行摘要性文本总结。
PeerJ Comput Sci. 2023 Jan 13;9:e1176. doi: 10.7717/peerj-cs.1176. eCollection 2023.

引用本文的文献

1
The usage of a transformer based and artificial intelligence driven multidimensional feedback system in english writing instruction.一种基于变压器和人工智能驱动的多维反馈系统在英语写作教学中的应用。
Sci Rep. 2025 Jun 2;15(1):19268. doi: 10.1038/s41598-025-05026-9.
2
Comparative analysis of text-based plagiarism detection techniques.基于文本的抄袭检测技术的比较分析。
PLoS One. 2025 Apr 8;20(4):e0319551. doi: 10.1371/journal.pone.0319551. eCollection 2025.

本文引用的文献

1
Important citation identification by exploiting content and section-wise in-text citation count.利用内容和按节计算的内文引文计数来进行重要引文识别。
PLoS One. 2020 Mar 5;15(3):e0228885. doi: 10.1371/journal.pone.0228885. eCollection 2020.