Gero Zelalem, Ho Joyce
Emory University, Department of Computer Science, Atlanta, USA.
Emory University, Department of Computer Science, Atlanta, USA.
J Biomed Inform. 2019;100S:100047. doi: 10.1016/j.yjbinx.2019.100047. Epub 2019 Jul 20.
Distributed semantic representation of biomedical text can be beneficial for text classification, named entity recognition, query expansion, human comprehension, and information retrieval. Despite the success of high-quality vector space models such as Word2Vec and GloVe, they only provide unigram word representations and the semantics for multi-word phrases can only be approximated by composition. This is problematic in biomedical text processing where technical phrases for diseases, symptoms, and drugs should be represented as single entities to capture the correct meaning. In this paper, we introduce PMCVec, an unsupervised technique that generates important phrases from PubMed abstracts and learns embeddings for single words and multi-word phrases simultaneously. Evaluations performed on benchmark datasets produce significant performance gains both qualitatively and quantitatively.
生物医学文本的分布式语义表示对于文本分类、命名实体识别、查询扩展、人类理解和信息检索可能是有益的。尽管诸如Word2Vec和GloVe等高质量向量空间模型取得了成功,但它们仅提供单字单词表示,多词短语的语义只能通过组合来近似。这在生物医学文本处理中是有问题的,因为疾病、症状和药物的技术短语应表示为单个实体以捕捉正确含义。在本文中,我们介绍了PMCVec,这是一种无监督技术,可从PubMed摘要中生成重要短语,并同时学习单字单词和多词短语的嵌入。在基准数据集上进行的评估在定性和定量方面都产生了显著的性能提升。