Nguyen Nhung T H, Miwa Makoto, Tsuruoka Yoshimasa, Tojo Satoshi
University of Science, Vietnam National University, Ho Chi Minh City, 227 Nguyen Van Cu St., Ward 4, Dist. 5, Ho Chi Minh City, Viet Nam; Japan Advanced Institute of Science and Technology, 1-8 Asahidai, Nomi-shi, Ishikawa 923-1292, Japan.
Toyota Technological Institute, 2-12-1 Hisakata, Tempaku-ku, Nagoya 468-8511, Japan.
J Biomed Inform. 2015 Aug;56:94-102. doi: 10.1016/j.jbi.2015.05.010. Epub 2015 May 22.
Many text mining applications in the biomedical domain benefit from automatic clustering of relational phrases into synonymous groups, since it alleviates the problem of spurious mismatches caused by the diversity of natural language expressions. Most of the previous work that has addressed this task of synonymy resolution uses similarity metrics between relational phrases based on textual strings or dependency paths, which, for the most part, ignore the context around the relations. To overcome this shortcoming, we employ a word embedding technique to encode relational phrases. We then apply the k-means algorithm on top of the distributional representations to cluster the phrases. Our experimental results show that this approach outperforms state-of-the-art statistical models including latent Dirichlet allocation and Markov logic networks.
生物医学领域的许多文本挖掘应用都受益于将关系短语自动聚类为同义组,因为这缓解了由自然语言表达的多样性所导致的虚假不匹配问题。之前处理同义性解析这项任务的大多数工作都使用基于文本字符串或依存路径的关系短语之间的相似性度量,而这些度量在很大程度上忽略了关系周围的上下文。为了克服这一缺点,我们采用词嵌入技术对关系短语进行编码。然后,我们在分布式表示之上应用k均值算法对短语进行聚类。我们的实验结果表明,这种方法优于包括潜在狄利克雷分配和马尔可夫逻辑网络在内的现有统计模型。