Division of Reproductive Health, Centers for Disease Control and Prevention, Atlanta, Georgia, USA.
Department of Bioinformatics and Biostatistics, SPHIS, University of Louisville, Louisville, Kentucky, USA.
Stat Med. 2018 Oct 15;37(23):3357-3372. doi: 10.1002/sim.7825. Epub 2018 Jun 19.
Multisample U-statistics encompass a wide class of test statistics that allow the comparison of 2 or more distributions. U-statistics are especially powerful because they can be applied to both numeric and nonnumeric data, eg, ordinal and categorical data where a pairwise similarity or distance-like measure between categories is available. However, when comparing the distribution of a variable across 2 or more groups, observed differences may be due to confounding covariates. For example, in a case-control study, the distribution of exposure in cases may differ from that in controls entirely because of variables that are related to both exposure and case status and are distributed differently among case and control participants. We propose to use individually reweighted data (ie, using the stratification score for retrospective data or the propensity score for prospective data) to construct adjusted U-statistics that can test the equality of distributions across 2 (or more) groups in the presence of confounding covariates. Asymptotic normality of our adjusted U-statistics is established and a closed form expression of their asymptotic variance is presented. The utility of our approach is demonstrated through simulation studies, as well as in an analysis of data from a case-control study conducted among African-Americans, comparing whether the similarity in haplotypes (ie, sets of adjacent genetic loci inherited from the same parent) occurring in a case and a control participant differs from the similarity in haplotypes occurring in 2 control participants.
多样本 U 统计量包含广泛的一类检验统计量,可用于比较 2 个或多个分布。U 统计量特别强大,因为它们可应用于数值和非数值数据,例如有序和分类数据,其中类别之间存在成对相似性或类似距离的度量。然而,当比较变量在 2 个或更多组之间的分布时,观察到的差异可能是由于混杂协变量引起的。例如,在病例对照研究中,病例组的暴露分布可能与对照组完全不同,这完全是因为与暴露和病例状态都相关的变量,并且在病例和对照参与者中分布不同。我们建议使用个体加权数据(即使用回顾性数据的分层得分或前瞻性数据的倾向得分)来构建调整后的 U 统计量,以在存在混杂协变量的情况下检验 2 个(或更多)组之间的分布是否相等。我们的调整后 U 统计量的渐近正态性得到了确立,并提出了它们渐近方差的闭式表达式。通过模拟研究以及在对非洲裔美国人进行的病例对照研究数据的分析中,证明了我们方法的实用性,比较了病例和对照参与者中发生的单倍型(即从同一父母遗传的一组相邻遗传位点)的相似性是否与 2 个对照参与者中发生的单倍型的相似性不同。