高维纵向数据的显著性检验

TEST OF SIGNIFICANCE FOR HIGH-DIMENSIONAL LONGITUDINAL DATA.

作者信息

Fang Ethan X, Ning Yang, Li Runze

机构信息

Department of Statistics, the Pennsylvania State University, University Park, PA 16802-2111, USA.

Department of Statistics and Data Science, Cornell University, Ithaca, NY 14850, USA.

出版信息

Ann Stat. 2020 Oct;48(5):2622-2645. doi: 10.1214/19-aos1900. Epub 2020 Sep 19.

DOI:10.1214/19-aos1900

PMID:34267407

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8277154/

Abstract

This paper concerns statistical inference for longitudinal data with ultrahigh dimensional covariates. We first study the problem of constructing confidence intervals and hypothesis tests for a low dimensional parameter of interest. The major challenge is how to construct a powerful test statistic in the presence of high-dimensional nuisance parameters and sophisticated within-subject correlation of longitudinal data. To deal with the challenge, we propose a new quadratic decorrelated inference function approach, which simultaneously removes the impact of nuisance parameters and incorporates the correlation to enhance the efficiency of the estimation procedure. When the parameter of interest is of fixed dimension, we prove that the proposed estimator is asymptotically normal and attains the semiparametric information bound, based on which we can construct an optimal Wald test statistic. We further extend this result and establish the limiting distribution of the estimator under the setting with the dimension of the parameter of interest growing with the sample size at a polynomial rate. Finally, we study how to control the false discovery rate (FDR) when a vector of high-dimensional regression parameters is of interest. We prove that applying the Storey (2002)'s procedure to the proposed test statistics for each regression parameter controls FDR asymptotically in longitudinal data. We conduct simulation studies to assess the finite sample performance of the proposed procedures. Our simulation results imply that the newly proposed procedure can control both Type I error for testing a low dimensional parameter of interest and the FDR in the multiple testing problem. We also apply the proposed procedure to a real data example.

摘要

本文关注具有超高维协变量的纵向数据的统计推断。我们首先研究为感兴趣的低维参数构建置信区间和假设检验的问题。主要挑战在于如何在存在高维干扰参数以及纵向数据复杂的个体内相关性的情况下构建一个强大的检验统计量。为应对这一挑战，我们提出了一种新的二次去相关推断函数方法，该方法同时消除干扰参数的影响并纳入相关性以提高估计过程的效率。当感兴趣的参数具有固定维度时，我们证明所提出的估计量渐近正态且达到半参数信息界，基于此我们可以构建一个最优的 Wald 检验统计量。我们进一步扩展这一结果，并在感兴趣参数的维度以多项式速率随样本量增长的设定下建立估计量的极限分布。最后，当感兴趣的是高维回归参数向量时，我们研究如何控制错误发现率（FDR）。我们证明将 Storey（2002）的方法应用于针对每个回归参数的所提出的检验统计量，在纵向数据中渐近地控制 FDR。我们进行模拟研究以评估所提出方法的有限样本性能。我们的模拟结果表明，新提出的方法可以控制用于检验感兴趣的低维参数的 I 型错误以及多重检验问题中的 FDR。我们还将所提出的方法应用于一个实际数据示例。