Department of Health Sciences, University of Wisconsin-Milwaukee, Milwaukee, WI 53211, United States.
J Biomed Inform. 2011 Dec;44(6):1032-8. doi: 10.1016/j.jbi.2011.08.008. Epub 2011 Aug 12.
Both healthcare professionals and healthcare consumers have information needs that can be met through the use of computers, specifically via medical question answering systems. However, the information needs of both groups are different in terms of literacy levels and technical expertise, and an effective question answering system must be able to account for these differences if it is to formulate the most relevant responses for users from each group. In this paper, we propose that a first step toward answering the queries of different users is automatically classifying questions according to whether they were asked by healthcare professionals or consumers.
We obtained two sets of consumer questions (~10,000 questions in total) from Yahoo answers. The professional questions consist of two question collections: 4654 point-of-care questions (denoted as PointCare) obtained from interviews of a group of family doctors following patient visits and 5378 questions from physician practices through professional online services (denoted as OnlinePractice). With more than 20,000 questions combined, we developed supervised machine-learning models for automatic classification between consumer questions and professional questions. To evaluate the robustness of our models, we tested the model that was trained on the Consumer-PointCare dataset on the Consumer-OnlinePractice dataset. We evaluated both linguistic features and statistical features and examined how the characteristics in two different types of professional questions (PointCare vs. OnlinePractice) may affect the classification performance. We explored information gain for feature reduction and the back-off linguistic category features.
The 10-fold cross-validation results showed the best F1-measure of 0.936 and 0.946 on Consumer-PointCare and Consumer-OnlinePractice respectively, and the best F1-measure of 0.891 when testing the Consumer-PointCare model on the Consumer-OnlinePractice dataset.
Healthcare consumer questions posted at Yahoo online communities can be reliably classified from professional questions posted by point-of-care clinicians and online physicians. The supervised machine-learning models are robust for this task. Our study will significantly benefit further development in automated consumer question answering.
医疗保健专业人员和医疗保健消费者都有信息需求,可以通过使用计算机来满足,特别是通过医学问答系统。然而,这两个群体的信息需求在文化程度和技术专长方面存在差异,如果问答系统要为每个群体的用户制定最相关的回复,就必须能够考虑到这些差异。在本文中,我们提出,回答不同用户查询的第一步是根据问题是由医疗保健专业人员还是消费者提出,自动对问题进行分类。
我们从雅虎问答中获得了两组消费者问题(总共约 10000 个问题)。专业问题包括两组问题集:从一组家庭医生在患者就诊后进行的访谈中获得的 4654 个即时护理问题(记为 PointCare),以及通过专业在线服务(记为 OnlinePractice)从医生实践中获得的 5378 个问题。结合超过 20000 个问题,我们开发了用于消费者问题和专业问题自动分类的监督机器学习模型。为了评估模型的稳健性,我们在 Consumer-OnlinePractice 数据集上测试了在 Consumer-PointCare 数据集上训练的模型。我们评估了语言特征和统计特征,并研究了两种不同类型的专业问题(PointCare 与 OnlinePractice)中的特征如何影响分类性能。我们探索了特征减少的信息增益和回退语言类别特征。
10 折交叉验证结果分别在 Consumer-PointCare 和 Consumer-OnlinePractice 上达到了最佳的 F1 度量值 0.936 和 0.946,在 Consumer-PointCare 模型在 Consumer-OnlinePractice 数据集上测试时达到了最佳的 F1 度量值 0.891。
在雅虎在线社区发布的医疗保健消费者问题可以与即时护理临床医生和在线医生发布的专业问题可靠地区分。监督机器学习模型在这项任务中具有稳健性。我们的研究将极大地促进自动化消费者问答的进一步发展。