利用深度神经网络模型消除 CQA 平台中的数据重复。

Eliminating Data Duplication in CQA Platforms Using Deep Neural Model.

机构信息

School of Computing Science & Engineering Galgotias University, Greater Noida, Uttar Pradesh, India.

出版信息

Comput Intell Neurosci. 2022 Aug 25;2022:2067449. doi: 10.1155/2022/2067449. eCollection 2022.

DOI:10.1155/2022/2067449

PMID:36059414

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9436542/

Abstract

Primary research to detect duplicate question pairs within community-based question answering systems is based on datasets made of English questions only. This research put forward a solution to the problem of duplicate question detection by matching semantically identical questions in transliterated bilingual data. Deep learning has been implemented to analyze informal languages like Hinglish which is a bilingual mix of Hindi and English on Community Question Answering (CQA) platforms to identify duplicacy in questions. The proposed model works in two sequential modules. First module is a language transliteration module which converts input questions into a mono-language text. The next module takes the transliterated text where a hybrid deep learning model which is implemented using multiple layers is used to detect duplicate questions in the mono-lingual data. The similarity between the question pairs is done utilizing this hybrid model combining a Siamese neural network with identical capsule network as the subnetworks and a decision tree classifier. Manhattan distance function is used with the Siamese network for computing the similarity between questions. The proposed model has been validated on 150 pairs of questions which were scrapped from various social media platforms, such as Tripadvisor and Quora which achieves accuracy of 87.0885% and AUC-ROC value of 0.86.

摘要

针对基于社区问答系统中的重复问题对进行的初步研究仅基于英文问题的数据集。本研究提出了一种通过匹配音译双语数据中语义相同的问题来检测重复问题的解决方案。在社区问答 (CQA) 平台上，深度学习已被用于分析印地语和英语混合的非正式语言，如印地语英语混合的 Hinglish，以识别问题中的重复。所提出的模型在两个连续的模块中工作。第一个模块是语言音译模块，它将输入的问题转换为单语言文本。下一个模块采用音译文本，其中使用多个层实现了混合深度学习模型，用于检测单语言数据中的重复问题。通过使用 Siamese 神经网络和相同的胶囊网络作为子网络以及决策树分类器的组合，利用这个混合模型来计算问题对之间的相似度。曼哈顿距离函数与 Siamese 网络一起用于计算问题之间的相似度。该模型已在从 Tripadvisor 和 Quora 等各种社交媒体平台上抓取的 150 对问题上进行了验证，准确率达到 87.0885%，AUC-ROC 值为 0.86。