数据自适应目标参数的统计推断

Statistical Inference for Data Adaptive Target Parameters.

作者信息

Hubbard Alan E, Kherad-Pajouh Sara, van der Laan Mark J

出版信息

Int J Biostat. 2016 May 1;12(1):3-19. doi: 10.1515/ijb-2015-0013.

DOI:10.1515/ijb-2015-0013

Abstract

Consider one observes n i.i.d. copies of a random variable with a probability distribution that is known to be an element of a particular statistical model. In order to define our statistical target we partition the sample in V equal size sub-samples, and use this partitioning to define V splits in an estimation sample (one of the V subsamples) and corresponding complementary parameter-generating sample. For each of the V parameter-generating samples, we apply an algorithm that maps the sample to a statistical target parameter. We define our sample-split data adaptive statistical target parameter as the average of these V-sample specific target parameters. We present an estimator (and corresponding central limit theorem) of this type of data adaptive target parameter. This general methodology for generating data adaptive target parameters is demonstrated with a number of practical examples that highlight new opportunities for statistical learning from data. This new framework provides a rigorous statistical methodology for both exploratory and confirmatory analysis within the same data. Given that more research is becoming "data-driven", the theory developed within this paper provides a new impetus for a greater involvement of statistical inference into problems that are being increasingly addressed by clever, yet ad hoc pattern finding methods. To suggest such potential, and to verify the predictions of the theory, extensive simulation studies, along with a data analysis based on adaptively determined intervention rules are shown and give insight into how to structure such an approach. The results show that the data adaptive target parameter approach provides a general framework and resulting methodology for data-driven science.

摘要

假设有一个随机变量，其概率分布已知是特定统计模型中的一个元素，我们观察到该随机变量的(n)个独立同分布样本。为了定义我们的统计目标，我们将样本划分为(V)个大小相等的子样本，并利用这种划分在一个估计样本（(V)个子样本之一）以及相应的互补参数生成样本中定义(V)种划分。对于(V)个参数生成样本中的每一个，我们应用一种算法，将样本映射到一个统计目标参数。我们将样本划分数据自适应统计目标参数定义为这(V)个特定于样本的目标参数的平均值。我们给出了这种类型的数据自适应目标参数的一个估计量（以及相应的中心极限定理）。通过一些实际例子展示了这种生成数据自适应目标参数的一般方法，这些例子突出了从数据中进行统计学习的新机会。这个新框架为同一数据内的探索性分析和验证性分析提供了一种严格的统计方法。鉴于越来越多的研究变得“数据驱动”，本文所发展的理论为将统计推断更多地纳入那些越来越多地由巧妙但临时的模式发现方法所解决的问题提供了新的动力。为了说明这种潜力，并验证理论的预测，展示了广泛的模拟研究以及基于自适应确定的干预规则的数据分析，并深入了解了如何构建这样一种方法。结果表明，数据自适应目标参数方法为数据驱动的科学提供了一个通用框架和相应的方法。