• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用规则和深度学习方法纠正波斯语文本中的拼写错误。

Correcting spelling mistakes in Persian texts with rules and deep learning methods.

作者信息

Kasmaiee Sa, Kasmaiee Si, Homayounpour M

机构信息

Department of Computer Engineering, Amirkabir University of Technology, Tehran, Iran.

出版信息

Sci Rep. 2023 Nov 15;13(1):19945. doi: 10.1038/s41598-023-47295-2.

DOI:10.1038/s41598-023-47295-2
PMID:37968293
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10652024/
Abstract

This study aims to develop a system for automatically correcting spelling errors in Persian texts using two approaches: one that relies on rules and a common spelling mistake list and another that uses a deep neural network. The list of 700 common misspellings was compiled, and a database of 55,000 common Persian words was used to identify spelling errors in the rule-based approach. 112 rules were implemented for spelling correction, each providing suggested words for misspelled words. 2500 sentences were used for evaluation, with the word with the shortest Levenshtein distance selected for evaluation. In the deep learning approach, a deep encoder-decoder network that utilized long short-term memory (LSTM) with a word embedding layer was used as the base network, with FastText chosen as the word embedding layer. The base network was enhanced by adding convolutional and capsule layers. A database of 1.2 million sentences was created, with 800,000 for training, 200,000 for testing, and 200,000 for evaluation. The results showed that the network's performance with capsule and convolutional layers was similar to that of the base network. The network performed well in evaluation, achieving accuracy, precision, recall, F-measure, and bilingual evaluation understudy (Bleu) scores of 87%, 70%, 89%, 78%, and 84%, respectively.

摘要

本研究旨在开发一个用于自动纠正波斯语文本拼写错误的系统,采用两种方法:一种依赖规则和常见拼写错误列表,另一种使用深度神经网络。编制了700个常见拼写错误的列表,并使用一个包含55000个常见波斯语单词的数据库,在基于规则的方法中识别拼写错误。实施了112条拼写纠正规则,每条规则为拼写错误的单词提供建议单词。使用2500个句子进行评估,选择编辑距离最短的单词进行评估。在深度学习方法中,一个利用带有词嵌入层的长短期记忆(LSTM)的深度编码器-解码器网络被用作基础网络,选择FastText作为词嵌入层。通过添加卷积层和胶囊层对基础网络进行了增强。创建了一个包含120万个句子的数据库,其中80万用于训练,20万用于测试,20万用于评估。结果表明,带有胶囊层和卷积层的网络性能与基础网络相似。该网络在评估中表现良好,准确率、精确率、召回率、F值和双语评估替补分数(Bleu)分别达到87%、70%、89%、78%和84%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/9a5ecee998aa/41598_2023_47295_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/7cf2ab350a75/41598_2023_47295_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/6a16856e43a4/41598_2023_47295_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/c3f9bafbe7c6/41598_2023_47295_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/90e71a19999b/41598_2023_47295_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/9a8cd4a5123d/41598_2023_47295_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/4a24d5db5092/41598_2023_47295_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/c16602fc1a3a/41598_2023_47295_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/cda6935fc096/41598_2023_47295_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/7723294d9c16/41598_2023_47295_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/915e37734026/41598_2023_47295_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/ba8399382819/41598_2023_47295_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/9f52fa016371/41598_2023_47295_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/9a5ecee998aa/41598_2023_47295_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/7cf2ab350a75/41598_2023_47295_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/6a16856e43a4/41598_2023_47295_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/c3f9bafbe7c6/41598_2023_47295_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/90e71a19999b/41598_2023_47295_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/9a8cd4a5123d/41598_2023_47295_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/4a24d5db5092/41598_2023_47295_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/c16602fc1a3a/41598_2023_47295_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/cda6935fc096/41598_2023_47295_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/7723294d9c16/41598_2023_47295_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/915e37734026/41598_2023_47295_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/ba8399382819/41598_2023_47295_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/9f52fa016371/41598_2023_47295_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fdb/10652024/9a5ecee998aa/41598_2023_47295_Fig13_HTML.jpg

相似文献

1
Correcting spelling mistakes in Persian texts with rules and deep learning methods.使用规则和深度学习方法纠正波斯语文本中的拼写错误。
Sci Rep. 2023 Nov 15;13(1):19945. doi: 10.1038/s41598-023-47295-2.
2
Similarity-Based Unsupervised Spelling Correction Using BioWordVec: Development and Usability Study of Bacterial Culture and Antimicrobial Susceptibility Reports.使用BioWordVec的基于相似度的无监督拼写校正:细菌培养和药敏报告的开发与可用性研究
JMIR Med Inform. 2021 Feb 22;9(2):e25530. doi: 10.2196/25530.
3
Improving the quality of Persian clinical text with a novel spelling correction system.利用新型拼写纠错系统提高波斯语临床文本质量。
BMC Med Inform Decis Mak. 2024 Aug 5;24(1):220. doi: 10.1186/s12911-024-02613-0.
4
Persian sentiment analysis of an online store independent of pre-processing using convolutional neural network with fastText embeddings.使用具有fastText嵌入的卷积神经网络对在线商店进行独立于预处理的波斯语情感分析。
PeerJ Comput Sci. 2021 Mar 5;7:e422. doi: 10.7717/peerj-cs.422. eCollection 2021.
5
A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance.深度学习模型在不同类别不平衡程度的非结构化医疗记录文本分类中的对比研究。
BMC Med Res Methodol. 2022 Jul 2;22(1):181. doi: 10.1186/s12874-022-01665-y.
6
Hybrid of Deep Learning and Word Embedding in Generating Captions: Image-Captioning Solution for Geological Rock Images.深度学习与词嵌入相结合生成图像字幕:地质岩石图像的图像字幕解决方案
J Imaging. 2022 Oct 22;8(11):294. doi: 10.3390/jimaging8110294.
7
Context-Sensitive Spelling Correction of Consumer-Generated Content on Health Care.基于语境的消费者生成医疗保健内容拼写纠错。
JMIR Med Inform. 2015 Jul 31;3(3):e27. doi: 10.2196/medinform.4211.
8
An efficient prototype method to identify and correct misspellings in clinical text.一种用于识别和纠正临床文本中拼写错误的高效原型方法。
BMC Res Notes. 2019 Jan 18;12(1):42. doi: 10.1186/s13104-019-4073-y.
9
RadioBERT: A deep learning-based system for medical report generation from chest X-ray images using contextual embeddings.RadioBERT:一种基于深度学习的系统,用于使用上下文嵌入从胸部 X 光图像生成医学报告。
J Biomed Inform. 2022 Nov;135:104220. doi: 10.1016/j.jbi.2022.104220. Epub 2022 Oct 10.
10
A deep learning approach in predicting products' sentiment ratings: a comparative analysis.一种用于预测产品情感评分的深度学习方法:比较分析。
J Supercomput. 2022;78(5):7206-7226. doi: 10.1007/s11227-021-04169-6. Epub 2021 Nov 5.

引用本文的文献

1
Unsteady CFD simulation of a rotor blade under various wind conditions.在各种风况下对转子叶片进行非定常计算流体动力学模拟。
Sci Rep. 2024 Aug 19;14(1):19176. doi: 10.1038/s41598-024-70350-5.
2
Use of sentiment analysis for capturing hospitalized cancer patients' experience from free-text comments in the Persian language.使用情感分析捕捉波斯语中住院癌症患者的自由文本评论中的体验。
BMC Med Inform Decis Mak. 2023 Nov 29;23(1):275. doi: 10.1186/s12911-023-02358-2.

本文引用的文献

1
A comparative investigation of machine learning algorithms for predicting safety signs comprehension based on socio-demographic factors and cognitive sign features.基于社会人口因素和认知标志特征的机器算法预测安全标志理解的比较研究。
Sci Rep. 2023 Jul 5;13(1):10843. doi: 10.1038/s41598-023-38065-1.
2
A Survey of Current Machine Learning Approaches to Student Free-Text Evaluation for Intelligent Tutoring.智能辅导中用于学生自由文本评估的当前机器学习方法综述。
Int J Artif Intell Educ. 2022 Nov 28:1-39. doi: 10.1007/s40593-022-00323-0.
3
A pre-trained BERT for Korean medical natural language processing.
用于韩语医学自然语言处理的预训练 BERT。
Sci Rep. 2022 Aug 16;12(1):13847. doi: 10.1038/s41598-022-17806-8.
4
Multi-class sentiment analysis of urdu text using multilingual BERT.使用多语言 BERT 进行乌尔都语文本的多类情感分析。
Sci Rep. 2022 Mar 31;12(1):5436. doi: 10.1038/s41598-022-09381-9.
5
Neural machine translation of chemical nomenclature between English and Chinese.英文与中文之间化学命名法的神经机器翻译。
J Cheminform. 2020 Aug 31;12(1):50. doi: 10.1186/s13321-020-00457-0.
6
Validation of deep learning natural language processing algorithm for keyword extraction from pathology reports in electronic health records.深度学习自然语言处理算法在电子病历中从病理报告中提取关键词的验证。
Sci Rep. 2020 Nov 20;10(1):20265. doi: 10.1038/s41598-020-77258-w.
7
A Survey of the Usages of Deep Learning for Natural Language Processing.深度学习在自然语言处理中的应用调查。
IEEE Trans Neural Netw Learn Syst. 2021 Feb;32(2):604-624. doi: 10.1109/TNNLS.2020.2979670. Epub 2021 Feb 4.
8
Automated Misspelling Detection and Correction in Persian Clinical Text.波斯语临床文本中的自动拼写错误检测与纠正。
J Digit Imaging. 2020 Jun;33(3):555-562. doi: 10.1007/s10278-019-00296-y.
9
BioWordVec, improving biomedical word embeddings with subword information and MeSH.BioWordVec,利用子词信息和 MeSH 改进生物医学词向量。
Sci Data. 2019 May 10;6(1):52. doi: 10.1038/s41597-019-0055-0.
10
Long short-term memory.长短期记忆
Neural Comput. 1997 Nov 15;9(8):1735-80. doi: 10.1162/neco.1997.9.8.1735.