• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用微调双向长短期记忆网络(BiLSTM)框架对乌尔都语文本进行释义检测。

Paraphrase detection for Urdu language text using fine-tune BiLSTM framework.

作者信息

Aslam Muhammad Ali, Khan Khairullah, Khan Wahab, Khan Sajid Ullah, Albanyan Abdullah, Algamdi Shabbab Ali

机构信息

Department of Computer Science, University of Science and Technology, Bannu, 28100, Pakistan.

Department of Information Systems, College of Computer Engineering and Sciences, Prince Sattam Bin Abdul Aziz University, Al-Kharj, Kingdom of Saudi Arabia.

出版信息

Sci Rep. 2025 May 2;15(1):15383. doi: 10.1038/s41598-025-93260-6.

DOI:10.1038/s41598-025-93260-6
PMID:40316633
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12048677/
Abstract

Automated paraphrase detection is crucial for natural language processing (NL) applications like text summarization, plagiarism detection, and question-answering systems. Detecting paraphrases in Urdu text remains challenging due to the language's complex morphology, distinctive script, and lack of resources such as labelled datasets, pre-trained models, and tailored NLP tools. This research proposes a novel bidirectional long short-term memory (BiLSTM) framework to address Urdu paraphrase detection's intricacies. Our approach employs word embeddings and text preprocessing techniques like tokenization, stop-word removal, and label encoding to effectively handle Urdu's morphological variations. The BiLSTM network sequentially processes the input, leveraging both forward and backward contextual information to encode the complex syntactic and semantic patterns inherent in Urdu text. An essential contribution of this work is the creation of a large-scale Urdu Paraphrased Corpus (UPC) comprising 400,000 potential sentence pair duplicates, with 150,000 pairs manually identified as paraphrases. Our findings reveal a significant improvement in paraphrase detection performance compared to existing methods. We provide insights into the underlying linguistic features and patterns that contribute to the robustness of our framework. This resource facilitates training and evaluating Urdu paraphrase detection models. Experimental evaluations on the custom UPC dataset demonstrate our BiLSTM model's superiority, achieving 94.14% accuracy and outperforming state-of-the-art methods like CNN (83.43%) and LSTM (88.09%). Our model attains an impressive 95.34% accuracy on the benchmark Quora dataset. Furthermore, we incorporate a comprehensive linguistic rule engine to handle exceptional cases during paraphrase analysis, ensuring robust performance across diverse contexts.

摘要

自动释义检测对于诸如文本摘要、抄袭检测和问答系统等自然语言处理(NL)应用至关重要。由于乌尔都语复杂的形态、独特的文字以及缺乏如标记数据集、预训练模型和定制的自然语言处理工具等资源,在乌尔都语文本中检测释义仍然具有挑战性。本研究提出了一种新颖的双向长短期记忆(BiLSTM)框架来解决乌尔都语释义检测的复杂性。我们的方法采用词嵌入和文本预处理技术,如分词、停用词去除和标签编码,以有效处理乌尔都语的形态变化。BiLSTM网络按顺序处理输入,利用向前和向后的上下文信息对乌尔都语文本中固有的复杂句法和语义模式进行编码。这项工作的一个重要贡献是创建了一个大规模的乌尔都语释义语料库(UPC),其中包括400,000个潜在的句子对重复项,其中150,000对被人工识别为释义。我们的研究结果表明,与现有方法相比,释义检测性能有了显著提高。我们深入了解了有助于我们框架稳健性的潜在语言特征和模式。这种资源有助于训练和评估乌尔都语释义检测模型。在自定义UPC数据集上的实验评估证明了我们的BiLSTM模型的优越性,准确率达到94.14%,优于CNN(83.43%)和LSTM(88.09%)等现有方法。我们的模型在基准Quora数据集上达到了令人印象深刻的95.34%的准确率。此外,我们纳入了一个全面的语言规则引擎,以在释义分析过程中处理特殊情况,确保在不同上下文中都有稳健的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f4a/12048677/08e2658073db/41598_2025_93260_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f4a/12048677/a2d8ac813843/41598_2025_93260_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f4a/12048677/0afd522692fc/41598_2025_93260_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f4a/12048677/469187aa936a/41598_2025_93260_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f4a/12048677/3d7cb3ac150e/41598_2025_93260_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f4a/12048677/c5ee9c841e96/41598_2025_93260_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f4a/12048677/08e2658073db/41598_2025_93260_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f4a/12048677/a2d8ac813843/41598_2025_93260_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f4a/12048677/0afd522692fc/41598_2025_93260_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f4a/12048677/469187aa936a/41598_2025_93260_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f4a/12048677/3d7cb3ac150e/41598_2025_93260_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f4a/12048677/c5ee9c841e96/41598_2025_93260_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f4a/12048677/08e2658073db/41598_2025_93260_Fig6_HTML.jpg

相似文献

1
Paraphrase detection for Urdu language text using fine-tune BiLSTM framework.使用微调双向长短期记忆网络(BiLSTM)框架对乌尔都语文本进行释义检测。
Sci Rep. 2025 May 2;15(1):15383. doi: 10.1038/s41598-025-93260-6.
2
A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu.一个用于乌尔都语中内在抄袭检测、文本重用分析和作者聚类的多功能数据集。
Data Brief. 2023 Nov 26;52:109857. doi: 10.1016/j.dib.2023.109857. eCollection 2024 Feb.
3
A deep learning approach for Named Entity Recognition in Urdu language.一种用于乌尔都语命名实体识别的深度学习方法。
PLoS One. 2024 Mar 28;19(3):e0300725. doi: 10.1371/journal.pone.0300725. eCollection 2024.
4
Siamese-Based Architecture for Cross-Lingual Plagiarism Detection in English-Hindi Language Pairs.基于暹罗的跨语言剽窃检测方法在英语-印地语对中的应用。
Big Data. 2023 Feb;11(1):48-58. doi: 10.1089/big.2020.0243. Epub 2022 Oct 18.
5
Roman Urdu Hate Speech Detection Using Transformer-Based Model for Cyber Security Applications.基于转换器模型的罗曼 Urdu 仇恨言论检测在网络安全应用中的研究
Sensors (Basel). 2023 Apr 12;23(8):3909. doi: 10.3390/s23083909.
6
A hybrid dependency-based approach for Urdu sentiment analysis.一种基于混合依存关系的乌尔都语情感分析方法。
Sci Rep. 2023 Dec 12;13(1):22075. doi: 10.1038/s41598-023-48817-8.
7
Cyberbullying detection: advanced preprocessing techniques & deep learning architecture for Roman Urdu data.网络欺凌检测:针对乌尔都语数据的先进预处理技术与深度学习架构
J Big Data. 2021;8(1):160. doi: 10.1186/s40537-021-00550-7. Epub 2021 Dec 22.
8
Improving data augmentation for low resource speech-to-text translation with diverse paraphrasing.利用多样化的释义改进低资源语音到文本翻译的数据增强。
Neural Netw. 2022 Apr;148:194-205. doi: 10.1016/j.neunet.2022.01.016. Epub 2022 Feb 1.
9
Multi-label emotion classification of Urdu tweets.乌尔都语推文的多标签情感分类
PeerJ Comput Sci. 2022 Apr 22;8:e896. doi: 10.7717/peerj-cs.896. eCollection 2022.
10
DeBERTa-BiLSTM: A multi-label classification model of Arabic medical questions using pre-trained models and deep learning.基于预训练模型和深度学习的阿拉伯文医学问题多标签分类模型:DeBERTa-BiLSTM
Comput Biol Med. 2024 Mar;170:107921. doi: 10.1016/j.compbiomed.2024.107921. Epub 2024 Jan 4.

本文引用的文献

1
An automated approach to identify sarcasm in low-resource language.一种识别低资源语言中讽刺意味的自动化方法。
PLoS One. 2024 Dec 5;19(12):e0307186. doi: 10.1371/journal.pone.0307186. eCollection 2024.
2
Toward robust and privacy-enhanced facial recognition: A decentralized blockchain-based approach with GANs and deep learning.迈向强大且增强隐私保护的人脸识别:基于去中心化区块链的 GAN 和深度学习方法。
Math Biosci Eng. 2024 Feb 26;21(3):4165-4186. doi: 10.3934/mbe.2024184.
3
SNLI Indo: A recognizing textual entailment dataset in Indonesian derived from the Stanford Natural Language Inference dataset.
SNLI印尼语版:一个源自斯坦福自然语言推理数据集的印尼语文本蕴含识别数据集。
Data Brief. 2023 Dec 21;52:109998. doi: 10.1016/j.dib.2023.109998. eCollection 2024 Feb.
4
DAFA-BiLSTM: Deep Autoregression Feature Augmented Bidirectional LSTM network for time series prediction.DAFA-BiLSTM:用于时间序列预测的深度自回归特征增强双向 LSTM 网络。
Neural Netw. 2023 Jan;157:240-256. doi: 10.1016/j.neunet.2022.10.009. Epub 2022 Oct 14.
5
Similarity-Based Virtual Screen Using Enhanced Siamese Deep Learning Methods.基于相似性的虚拟筛选:使用增强暹罗深度学习方法
ACS Omega. 2022 Feb 3;7(6):4769-4786. doi: 10.1021/acsomega.1c04587. eCollection 2022 Feb 15.
6
Using rule-based natural language processing to improve disease normalization in biomedical text.基于规则的自然语言处理在生物医学文本疾病标准化中的应用。
J Am Med Inform Assoc. 2013 Sep-Oct;20(5):876-81. doi: 10.1136/amiajnl-2012-001173. Epub 2012 Oct 6.