Institute for Visualization and Interactive Systems, Universitat Stuttgart.
IEEE Trans Vis Comput Graph. 2012 Dec;18(12):2839-48. doi: 10.1109/TVCG.2012.277.
Performing exhaustive searches over a large number of text documents can be tedious, since it is very hard to formulate search queries or define filter criteria that capture an analyst's information need adequately. Classification through machine learning has the potential to improve search and filter tasks encompassing either complex or very specific information needs, individually. Unfortunately, analysts who are knowledgeable in their field are typically not machine learning specialists. Most classification methods, however, require a certain expertise regarding their parametrization to achieve good results. Supervised machine learning algorithms, in contrast, rely on labeled data, which can be provided by analysts. However, the effort for labeling can be very high, which shifts the problem from composing complex queries or defining accurate filters to another laborious task, in addition to the need for judging the trained classifier's quality. We therefore compare three approaches for interactive classifier training in a user study. All of the approaches are potential candidates for the integration into a larger retrieval system. They incorporate active learning to various degrees in order to reduce the labeling effort as well as to increase effectiveness. Two of them encompass interactive visualization for letting users explore the status of the classifier in context of the labeled documents, as well as for judging the quality of the classifier in iterative feedback loops. We see our work as a step towards introducing user controlled classification methods in addition to text search and filtering for increasing recall in analytics scenarios involving large corpora.
对大量文本文件进行穷尽搜索可能会很乏味,因为很难制定搜索查询或定义筛选标准来充分捕捉分析师的信息需求。通过机器学习进行分类有可能改进搜索和筛选任务,无论是复杂的还是非常特定的信息需求,都可以分别进行改进。不幸的是,在自己领域内有知识的分析师通常不是机器学习专家。然而,大多数分类方法都需要对其参数化有一定的专业知识,才能取得良好的效果。相比之下,监督机器学习算法依赖于分析师可以提供的标记数据。但是,标记的工作量可能非常大,这除了需要判断训练好的分类器的质量之外,还将问题从编写复杂查询或定义准确的筛选器转移到另一个繁琐的任务。因此,我们在用户研究中比较了三种交互式分类器训练方法。所有这些方法都是集成到更大检索系统中的潜在候选方法。它们在不同程度上结合了主动学习,以减少标记工作,并提高效率。其中两种方法包括交互式可视化,以便用户在标记文档的上下文中探索分类器的状态,以及在迭代反馈循环中判断分类器的质量。我们认为,我们的工作是朝着在涉及大型语料库的分析场景中除文本搜索和筛选之外引入用户控制的分类方法迈进,以提高召回率。