1 Rensselaer Polytechnic Institute , Troy, New York.
2 Information Sciences Institute , Marina del Rey, California.
Big Data. 2017 Mar;5(1):19-31. doi: 10.1089/big.2017.0012.
The ability of automatically recognizing and typing entities in natural language without prior knowledge (e.g., predefined entity types) is a major challenge in processing such data. Most existing entity typing systems are limited to certain domains, genres, and languages. In this article, we propose a novel unsupervised entity-typing framework by combining symbolic and distributional semantics. We start from learning three types of representations for each entity mention: general semantic representation, specific context representation, and knowledge representation based on knowledge bases. Then we develop a novel joint hierarchical clustering and linking algorithm to type all mentions using these representations. This framework does not rely on any annotated data, predefined typing schema, or handcrafted features; therefore, it can be quickly adapted to a new domain, genre, and/or language. Experiments on genres (news and discussion forum) show comparable performance with state-of-the-art supervised typing systems trained from a large amount of labeled data. Results on various languages (English, Chinese, Japanese, Hausa, and Yoruba) and domains (general and biomedical) demonstrate the portability of our framework.
在不依赖先验知识(例如,预定义的实体类型)的情况下,自动识别和输入自然语言实体的能力是处理此类数据的主要挑战。大多数现有的实体类型系统都仅限于某些领域、体裁和语言。在本文中,我们通过结合符号和分布语义学提出了一种新颖的无监督实体类型框架。我们从为每个实体提及学习三种类型的表示开始:一般语义表示、特定上下文表示和基于知识库的知识表示。然后,我们开发了一种新颖的联合层次聚类和链接算法,使用这些表示对所有提及进行类型化。该框架不依赖于任何带注释的数据、预定义的类型化模式或手工制作的特征;因此,它可以快速适应新的领域、体裁和/或语言。针对体裁(新闻和论坛)的实验表明,该框架的性能可与从大量带标签数据中训练的最先进的监督类型系统相媲美。针对各种语言(英语、中文、日语、豪萨语和约鲁巴语)和领域(一般和生物医学)的结果证明了我们框架的可移植性。