Suppr超能文献

使用微调双向长短期记忆网络(BiLSTM)框架对乌尔都语文本进行释义检测。

Paraphrase detection for Urdu language text using fine-tune BiLSTM framework.

作者信息

Aslam Muhammad Ali, Khan Khairullah, Khan Wahab, Khan Sajid Ullah, Albanyan Abdullah, Algamdi Shabbab Ali

机构信息

Department of Computer Science, University of Science and Technology, Bannu, 28100, Pakistan.

Department of Information Systems, College of Computer Engineering and Sciences, Prince Sattam Bin Abdul Aziz University, Al-Kharj, Kingdom of Saudi Arabia.

出版信息

Sci Rep. 2025 May 2;15(1):15383. doi: 10.1038/s41598-025-93260-6.

Abstract

Automated paraphrase detection is crucial for natural language processing (NL) applications like text summarization, plagiarism detection, and question-answering systems. Detecting paraphrases in Urdu text remains challenging due to the language's complex morphology, distinctive script, and lack of resources such as labelled datasets, pre-trained models, and tailored NLP tools. This research proposes a novel bidirectional long short-term memory (BiLSTM) framework to address Urdu paraphrase detection's intricacies. Our approach employs word embeddings and text preprocessing techniques like tokenization, stop-word removal, and label encoding to effectively handle Urdu's morphological variations. The BiLSTM network sequentially processes the input, leveraging both forward and backward contextual information to encode the complex syntactic and semantic patterns inherent in Urdu text. An essential contribution of this work is the creation of a large-scale Urdu Paraphrased Corpus (UPC) comprising 400,000 potential sentence pair duplicates, with 150,000 pairs manually identified as paraphrases. Our findings reveal a significant improvement in paraphrase detection performance compared to existing methods. We provide insights into the underlying linguistic features and patterns that contribute to the robustness of our framework. This resource facilitates training and evaluating Urdu paraphrase detection models. Experimental evaluations on the custom UPC dataset demonstrate our BiLSTM model's superiority, achieving 94.14% accuracy and outperforming state-of-the-art methods like CNN (83.43%) and LSTM (88.09%). Our model attains an impressive 95.34% accuracy on the benchmark Quora dataset. Furthermore, we incorporate a comprehensive linguistic rule engine to handle exceptional cases during paraphrase analysis, ensuring robust performance across diverse contexts.

摘要

自动释义检测对于诸如文本摘要、抄袭检测和问答系统等自然语言处理(NL)应用至关重要。由于乌尔都语复杂的形态、独特的文字以及缺乏如标记数据集、预训练模型和定制的自然语言处理工具等资源,在乌尔都语文本中检测释义仍然具有挑战性。本研究提出了一种新颖的双向长短期记忆(BiLSTM)框架来解决乌尔都语释义检测的复杂性。我们的方法采用词嵌入和文本预处理技术,如分词、停用词去除和标签编码,以有效处理乌尔都语的形态变化。BiLSTM网络按顺序处理输入,利用向前和向后的上下文信息对乌尔都语文本中固有的复杂句法和语义模式进行编码。这项工作的一个重要贡献是创建了一个大规模的乌尔都语释义语料库(UPC),其中包括400,000个潜在的句子对重复项,其中150,000对被人工识别为释义。我们的研究结果表明,与现有方法相比,释义检测性能有了显著提高。我们深入了解了有助于我们框架稳健性的潜在语言特征和模式。这种资源有助于训练和评估乌尔都语释义检测模型。在自定义UPC数据集上的实验评估证明了我们的BiLSTM模型的优越性,准确率达到94.14%,优于CNN(83.43%)和LSTM(88.09%)等现有方法。我们的模型在基准Quora数据集上达到了令人印象深刻的95.34%的准确率。此外,我们纳入了一个全面的语言规则引擎,以在释义分析过程中处理特殊情况,确保在不同上下文中都有稳健的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f4a/12048677/a2d8ac813843/41598_2025_93260_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验