He Zhe, Chen Zhiwei, Oh Sanghee, Hou Jinghui, Bian Jiang
School of Information, Florida State University, Tallahassee, FL 32306, USA; Institute for Successful Longevity, Florida State University, Tallahassee, FL 32306, USA.
Department of Computer Science, Florida State University, Tallahassee, FL 32306, USA.
J Biomed Inform. 2017 May;69:75-85. doi: 10.1016/j.jbi.2017.03.016. Epub 2017 Mar 27.
The widely known vocabulary gap between health consumers and healthcare professionals hinders information seeking and health dialogue of consumers on end-user health applications. The Open Access and Collaborative Consumer Health Vocabulary (OAC CHV), which contains health-related terms used by lay consumers, has been created to bridge such a gap. Specifically, the OAC CHV facilitates consumers' health information retrieval by enabling consumer-facing health applications to translate between professional language and consumer friendly language. To keep up with the constantly evolving medical knowledge and language use, new terms need to be identified and added to the OAC CHV. User-generated content on social media, including social question and answer (social Q&A) sites, afford us an enormous opportunity in mining consumer health terms. Existing methods of identifying new consumer terms from text typically use ad-hoc lexical syntactic patterns and human review. Our study extends an existing method by extracting n-grams from a social Q&A textual corpus and representing them with a rich set of contextual and syntactic features. Using K-means clustering, our method, simiTerm, was able to identify terms that are both contextually and syntactically similar to the existing OAC CHV terms. We tested our method on social Q&A corpora on two disease domains: diabetes and cancer. Our method outperformed three baseline ranking methods. A post-hoc qualitative evaluation by human experts further validated that our method can effectively identify meaningful new consumer terms on social Q&A.
健康消费者与医疗保健专业人员之间广为人知的词汇差距,阻碍了消费者在终端用户健康应用程序上寻求信息和进行健康对话。开放获取与协作式消费者健康词汇表(OAC CHV)应运而生,它包含普通消费者使用的与健康相关的术语,旨在弥合这一差距。具体而言,OAC CHV通过使面向消费者的健康应用程序能够在专业语言和消费者友好语言之间进行翻译,促进了消费者的健康信息检索。为了跟上不断发展的医学知识和语言使用情况,需要识别新术语并将其添加到OAC CHV中。社交媒体上的用户生成内容,包括社交问答(social Q&A)网站,为我们挖掘消费者健康术语提供了巨大机会。从文本中识别新消费者术语的现有方法通常使用临时的词汇句法模式和人工审核。我们的研究扩展了一种现有方法,即从社交问答文本语料库中提取n元语法并用丰富的上下文和句法特征来表示它们。使用K均值聚类,我们的方法simiTerm能够识别出在上下文和句法上与现有OAC CHV术语相似的术语。我们在两个疾病领域(糖尿病和癌症)的社交问答语料库上测试了我们的方法。我们的方法优于三种基线排序方法。人类专家进行的事后定性评估进一步验证了我们的方法能够有效地在社交问答中识别出有意义的新消费者术语。