Suppr超能文献

用于泰语特殊疑问句分类的深度学习自然语言处理词性标注增强

Part-of-Speech tagging enhancement to natural language processing for Thai wh-question classification with deep learning.

作者信息

Chotirat Saranlita, Meesad Phayung

机构信息

Department of Information Technology, Faculty of Information Technology and Digital Innovation, King Mongkut's University of Technology North Bangkok, Thailand.

Department of Information Technology Management, Faculty of Information Technology and Digital Innovation, King Mongkut's University of Technology North Bangkok, Thailand.

出版信息

Heliyon. 2021 Oct 19;7(10):e08216. doi: 10.1016/j.heliyon.2021.e08216. eCollection 2021 Oct.

Abstract

Question classification is a crucial task for answer selection. Question classification could help define the structure of question sentences generated by features extraction from a sentence, such as who, when, where, and how. In this paper, we proposed a methodology to improve question classification from texts by using feature selection and word embedding techniques. We conducted several experiments to evaluate the performance of the proposed methodology using two different datasets (TREC-6 dataset and Thai sentence dataset) with term frequency and combined term frequency-inverse document frequency including Unigram, Unigram+Bigram, and Unigram + Trigram as features. Machine learning models based on traditional and deep learning classifiers were used. The traditional classification models were Multinomial Naïve Bayes, Logistic Regression, and Support Vector Machine. The deep learning techniques were Bidirectional Long Short-Term Memory (BiLSTM), Convolutional Neural Networks (CNN), and Hybrid model, which combined CNN and BiLSTM model. The experiment results showed that our methodology based on Part-of-Speech (POS) tagging was the best to improve question classification accuracy. The classifying question categories achieved with average micro -score of 0.98 when applied SVM model on adding all POS tags in the TREC-6 dataset. The highest average micro -score achieved 0.8 when applied GloVe by using CNN model on adding focusing tags in the Thai sentences dataset.

摘要

问题分类是答案选择中的一项关键任务。问题分类有助于通过从句子中提取特征来定义生成的问题句子的结构,例如谁、何时、何地以及如何。在本文中,我们提出了一种通过使用特征选择和词嵌入技术来改进文本问题分类的方法。我们进行了多项实验,使用两个不同的数据集(TREC - 6数据集和泰语句子数据集),以词频以及包括一元词、一元词+二元词和一元词+三元词的组合词频 - 逆文档频率作为特征,来评估所提出方法的性能。使用了基于传统和深度学习分类器的机器学习模型。传统分类模型有多项式朴素贝叶斯、逻辑回归和支持向量机。深度学习技术包括双向长短期记忆(BiLSTM)、卷积神经网络(CNN)以及结合了CNN和BiLSTM模型的混合模型。实验结果表明,我们基于词性(POS)标注的方法在提高问题分类准确率方面效果最佳。在TREC - 6数据集中添加所有词性标签并应用支持向量机模型时,分类问题类别的平均微分值达到0.98。在泰语句子数据集中添加重点标签并使用CNN模型应用GloVe时,最高平均微分值达到0.8。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/867d/8554172/7c47a5c8cf9e/gr001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验