Inouye David, Yang Eunho, Allen Genevera, Ravikumar Pradeep
University of Texas at Austin.
Korea Advanced Institute of Science and Technology.
Wiley Interdiscip Rev Comput Stat. 2017 May-Jun;9(3). doi: 10.1002/wics.1398. Epub 2017 Mar 28.
The Poisson distribution has been widely studied and used for modeling univariate count-valued data. Multivariate generalizations of the Poisson distribution that permit dependencies, however, have been far less popular. Yet, real-world high-dimensional count-valued data found in word counts, genomics, and crime statistics, for example, exhibit rich dependencies, and motivate the need for multivariate distributions that can appropriately model this data. We review multivariate distributions derived from the univariate Poisson, categorizing these models into three main classes: 1) where the marginal distributions are Poisson, 2) where the joint distribution is a mixture of independent multivariate Poisson distributions, and 3) where the node-conditional distributions are derived from the Poisson. We discuss the development of multiple instances of these classes and compare the models in terms of interpretability and theory. Then, we empirically compare multiple models from each class on three real-world datasets that have varying data characteristics from different domains, namely traffic accident data, biological next generation sequencing data, and text data. These empirical experiments develop intuition about the comparative advantages and disadvantages of each class of multivariate distribution that was derived from the Poisson. Finally, we suggest new research directions as explored in the subsequent discussion section.
泊松分布已得到广泛研究,并用于对单变量计数值数据进行建模。然而,允许存在相依性的泊松分布的多变量推广却远没有那么受欢迎。然而,例如在词频统计、基因组学和犯罪统计中发现的现实世界高维计数值数据呈现出丰富的相依性,这激发了对能够适当地对这些数据进行建模的多变量分布的需求。我们回顾从单变量泊松分布导出的多变量分布,将这些模型分为三个主要类别:1)边际分布为泊松分布的模型;2)联合分布是独立多变量泊松分布混合的模型;3)节点条件分布从泊松分布导出的模型。我们讨论这些类别的多个实例的发展,并在可解释性和理论方面比较这些模型。然后,我们在三个具有不同领域数据特征的真实世界数据集上,对每个类别的多个模型进行实证比较,这三个数据集分别是交通事故数据、生物下一代测序数据和文本数据。这些实证实验使我们对源自泊松分布的每类多变量分布的相对优缺点有了直观认识。最后,我们在后续的讨论部分中提出新的研究方向。