Dai Ben, Shen Xiaotong, Wong Wing
School of Statistics, University of Minnesota, Minneapolis, MN 55455.
Department of Statistics and Biomedical Data Science, Stanford University, CA 94305.
J Am Stat Assoc. 2022;117(539):1243-1253. doi: 10.1080/01621459.2020.1844719. Epub 2021 Jan 4.
Instance generation creates representative examples to interpret a learning model, as in regression and classification. For example, representative sentences of a topic of interest describe the topic specifically for sentence categorization. In such a situation, a large number of unlabeled observations may be available in addition to labeled data, for example, many unclassified text corpora (unlabeled instances) are available with only a few classified sentences (labeled instances). In this article, we introduce a novel generative method, called a coupled generator, producing instances given a specific learning outcome, based on indirect and direct generators. The indirect generator uses the inverse principle to yield the corresponding inverse probability, enabling to generate instances by leveraging an unlabeled data. The direct generator learns the distribution of an instance given its learning outcome. Then, the coupled generator seeks the best one from the indirect and direct generators, which is designed to enjoy the benefits of both and deliver higher generation accuracy. For sentence generation given a topic, we develop an embedding-based regression/classification in conjuncture with an unconditional recurrent neural network for the indirect generator, whereas a conditional recurrent neural network is natural for the corresponding direct generator. Moreover, we derive finite-sample generation error bounds for the indirect and direct generators to reveal the generative aspects of both methods thus explaining the benefits of the coupled generator. Finally, we apply the proposed methods to a real benchmark of abstract classification and demonstrate that the coupled generator composes reasonably good sentences from a dictionary to describe a specific topic of interest.
实例生成用于创建具有代表性的示例以解释学习模型,回归和分类中皆是如此。例如,感兴趣主题的代表性句子会专门针对句子分类来描述该主题。在这种情况下,除了有标记数据之外,可能还存在大量未标记的观测值,例如,有许多未分类的文本语料库(未标记实例),而只有少数已分类的句子(有标记实例)。在本文中,我们介绍一种新颖的生成方法,称为耦合生成器,它基于间接生成器和直接生成器,在给定特定学习结果的情况下生成实例。间接生成器使用逆原理来产生相应的逆概率,从而能够通过利用未标记数据来生成实例。直接生成器在给定学习结果的情况下学习实例的分布。然后,耦合生成器从间接生成器和直接生成器中寻找最佳的那个,其设计目的是兼具两者的优点并提供更高的生成精度。对于给定主题的句子生成,我们结合用于间接生成器的无条件递归神经网络开发了一种基于嵌入的回归/分类方法,而条件递归神经网络对于相应的直接生成器来说是很自然的选择。此外,我们推导了间接生成器和直接生成器的有限样本生成误差界,以揭示这两种方法的生成特性,从而解释耦合生成器的优点。最后,我们将所提出的方法应用于摘要分类的实际基准测试,并证明耦合生成器能够从字典中组合出合理的好句子来描述感兴趣的特定主题。