Chotirat Saranlita, Meesad Phayung
Department of Information Technology, Faculty of Information Technology and Digital Innovation, King Mongkut's University of Technology North Bangkok, Thailand.
Department of Information Technology Management, Faculty of Information Technology and Digital Innovation, King Mongkut's University of Technology North Bangkok, Thailand.
Heliyon. 2021 Oct 19;7(10):e08216. doi: 10.1016/j.heliyon.2021.e08216. eCollection 2021 Oct.
Question classification is a crucial task for answer selection. Question classification could help define the structure of question sentences generated by features extraction from a sentence, such as who, when, where, and how. In this paper, we proposed a methodology to improve question classification from texts by using feature selection and word embedding techniques. We conducted several experiments to evaluate the performance of the proposed methodology using two different datasets (TREC-6 dataset and Thai sentence dataset) with term frequency and combined term frequency-inverse document frequency including Unigram, Unigram+Bigram, and Unigram + Trigram as features. Machine learning models based on traditional and deep learning classifiers were used. The traditional classification models were Multinomial Naïve Bayes, Logistic Regression, and Support Vector Machine. The deep learning techniques were Bidirectional Long Short-Term Memory (BiLSTM), Convolutional Neural Networks (CNN), and Hybrid model, which combined CNN and BiLSTM model. The experiment results showed that our methodology based on Part-of-Speech (POS) tagging was the best to improve question classification accuracy. The classifying question categories achieved with average micro -score of 0.98 when applied SVM model on adding all POS tags in the TREC-6 dataset. The highest average micro -score achieved 0.8 when applied GloVe by using CNN model on adding focusing tags in the Thai sentences dataset.
问题分类是答案选择中的一项关键任务。问题分类有助于通过从句子中提取特征来定义生成的问题句子的结构,例如谁、何时、何地以及如何。在本文中,我们提出了一种通过使用特征选择和词嵌入技术来改进文本问题分类的方法。我们进行了多项实验,使用两个不同的数据集(TREC - 6数据集和泰语句子数据集),以词频以及包括一元词、一元词+二元词和一元词+三元词的组合词频 - 逆文档频率作为特征,来评估所提出方法的性能。使用了基于传统和深度学习分类器的机器学习模型。传统分类模型有多项式朴素贝叶斯、逻辑回归和支持向量机。深度学习技术包括双向长短期记忆(BiLSTM)、卷积神经网络(CNN)以及结合了CNN和BiLSTM模型的混合模型。实验结果表明,我们基于词性(POS)标注的方法在提高问题分类准确率方面效果最佳。在TREC - 6数据集中添加所有词性标签并应用支持向量机模型时,分类问题类别的平均微分值达到0.98。在泰语句子数据集中添加重点标签并使用CNN模型应用GloVe时,最高平均微分值达到0.8。