Siddiqui Tarique, Ren Xiang, Parameswaran Aditya, Han Jiawei
University of Illinois at Urbana-Champaign, Urbana, IL, USA.
Proc ACM Int Conf Inf Knowl Manag. 2016 Oct;2016:871-880. doi: 10.1145/2983323.2983828.
Given the large volume of technical documents available, it is crucial to automatically organize and categorize these documents to be able to understand and extract value from them. Towards this end, we introduce a new research problem called Facet Extraction. Given a collection of technical documents, the goal of Facet Extraction is to automatically label each document with a set of concepts for the key facets (, application, technique, evaluation metrics, and dataset) that people may be interested in. Facet Extraction has numerous applications, including document summarization, literature search, patent search and business intelligence. The major challenge in performing Facet Extraction arises from multiple sources: concept extraction, concept to facet matching, and facet disambiguation. To tackle these challenges, we develop FacetGist, a framework for facet extraction. Facet Extraction involves constructing a graph-based heterogeneous network to capture information available across multiple sentence-level features, as well as context features. We then formulate a joint optimization problem, and propose an efficient algorithm for graph-based label propagation to estimate the facet of each concept mention. Experimental results on technical corpora from two domains demonstrate that Facet Extraction can lead to an improvement of over 25% in both precision and recall over competing schemes.
鉴于现有大量技术文档,自动对这些文档进行组织和分类对于理解并从中提取价值至关重要。为此,我们引入了一个名为“方面提取”的新研究问题。给定一组技术文档,方面提取的目标是用一组人们可能感兴趣的关键方面(如应用、技术、评估指标和数据集)的概念自动标记每个文档。方面提取有许多应用,包括文档摘要、文献搜索、专利搜索和商业智能。执行方面提取的主要挑战来自多个方面:概念提取、概念到方面的匹配以及方面消歧。为应对这些挑战,我们开发了FacetGist,一个方面提取框架。方面提取涉及构建基于图的异构网络,以捕获跨多个句子级特征以及上下文特征的可用信息。然后,我们制定一个联合优化问题,并提出一种基于图的标签传播的高效算法,以估计每个概念提及的方面。来自两个领域的技术语料库的实验结果表明,与竞争方案相比,方面提取在精确率和召回率方面都能提高超过25%。