Dept. of Computer Science & Engineering, UET, Lahore, Pakistan; Al-Khawarizmi Institute of Computer Science, UET, Lahore, Pakistan; Department of Computer Science, UMT Lahore, Sialkot Campus, Pakistan.
Al-Khawarizmi Institute of Computer Science, UET, Lahore, Pakistan; German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany.
J Biomed Inform. 2019 May;93:103143. doi: 10.1016/j.jbi.2019.103143. Epub 2019 Mar 12.
Question classification is considered one of the most significant phases of a typical Question Answering (QA) system. It assigns certain answer types to each question which leads to narrow down the search space of possible answers for factoid and list type questions. The process of assigning certain answer types to each question is also known as Lexical Answer Type (LAT) Prediction. Although much work has been done to enhance the performance of question classification into coarse and fine classes in diverse domains, it is still considered a challenging task in the biomedical field. The difficulty in biomedical question classification stems from the fact that one question might have more than one label or expected answer types associated with it (also, referred to as a multi-label classification). In the biomedical domain, only preliminary work is done to classify multi-label questions by transforming them into a single label through copy transformation technique. In this paper, we have generated a multi-labeled corpus (MLBioMedLAT) by exploring the process of Open Advancement of Question Answering (OAQA) system for the task of biomedical question classification. We use 780 biomedical questions from BioASQ challenge and assign them appropriate labels. To annotate these labels, we use the answers for each question and assign the question semantic type labels by leveraging an existing corpus and utilizing OAQA system. The paper introduces a data transformation approach namely Label Power Set with logistic regression (LPLR) for the task of multi-label biomedical question classification and compares its performance with Structured SVM (SSVM), Restricted Boltzmann Machine (RBM), and copy transformation based logistic regression (CLR) (previously used for a similar task in the OAQA system). To evaluate the integrity of the introduced data transformation technique, we use three prominent evaluation measures namely MicroF, Accuracy, and Hamming Loss. Regarding MicroF, our introduced technique coupled with a new feature set surpasses CLR, SSVM, and RBM with a margin of 7%, 8%, and 22% respectively.
问题分类被认为是典型问答 (QA) 系统中最重要的阶段之一。它为每个问题分配特定的答案类型,从而缩小事实和列表类型问题的可能答案搜索空间。为每个问题分配特定答案类型的过程也称为词汇答案类型 (LAT) 预测。尽管已经做了很多工作来提高在不同领域将问题分类为粗分类和细分类的性能,但在生物医学领域,这仍然是一项具有挑战性的任务。生物医学问题分类的困难源于一个问题可能与多个标签或预期的答案类型相关联的事实(也称为多标签分类)。在生物医学领域,仅通过复制转换技术将它们转换为单个标签来对多标签问题进行分类的初步工作。在本文中,我们通过探索开放问答系统 (OAQA) 用于生物医学问题分类任务的过程,生成了一个多标签语料库 (MLBioMedLAT)。我们从 BioASQ 挑战赛中使用了 780 个生物医学问题,并为它们分配了适当的标签。为了注释这些标签,我们使用每个问题的答案,并利用现有的语料库和 OAQA 系统为问题分配语义类型标签。本文介绍了一种数据转换方法,即逻辑回归的标签幂集 (LPLR),用于多标签生物医学问题分类,并将其性能与结构化 SVM (SSVM)、受限玻尔兹曼机 (RBM) 和基于复制的逻辑回归 (CLR)(以前用于 OAQA 系统中的类似任务)进行比较。为了评估引入的数据转换技术的完整性,我们使用了三个突出的评估指标,即 MicroF、准确性和汉明损失。关于 MicroF,我们引入的技术与新的特征集相结合,分别比 CLR、SSVM 和 RBM 高出 7%、8%和 22%。