Vydiswaran V G Vinod, Mei Qiaozhu, Hanauer David A, Zheng Kai
School of Information, University of Michigan, Ann Arbor, MI.
School of Information, University of Michigan, Ann Arbor, MI ; Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI.
AMIA Annu Symp Proc. 2014 Nov 14;2014:1150-9. eCollection 2014.
Community-generated text corpora can be a valuable resource to extract consumer health vocabulary (CHV) and link them to professional terminologies and alternative variants. In this research, we propose a pattern-based text-mining approach to identify pairs of CHV and professional terms from Wikipedia, a large text corpus created and maintained by the community. A novel measure, leveraging the ratio of frequency of occurrence, was used to differentiate consumer terms from professional terms. We empirically evaluated the applicability of this approach using a large data sample consisting of MedLine abstracts and all posts from an online health forum, MedHelp. The results show that the proposed approach is able to identify synonymous pairs and label the terms as either consumer or professional term with high accuracy. We conclude that the proposed approach provides great potential to produce a high quality CHV to improve the performance of computational applications in processing consumer-generated health text.
社区生成的文本语料库可以成为提取消费者健康词汇(CHV)并将它们与专业术语及替代变体相联系的宝贵资源。在本研究中,我们提出一种基于模式的文本挖掘方法,以从维基百科(一个由社区创建和维护的大型文本语料库)中识别CHV与专业术语对。一种利用出现频率比率的新颖度量方法被用于区分消费者术语和专业术语。我们使用一个由医学在线数据库摘要和一个在线健康论坛MedHelp的所有帖子组成的大样本数据,对该方法的适用性进行了实证评估。结果表明,所提出的方法能够识别同义词对,并以高精度将术语标记为消费者术语或专业术语。我们得出结论,所提出的方法具有生成高质量CHV的巨大潜力,可提高计算应用在处理消费者生成的健康文本方面的性能。