Marozzi Marco, Mukherjee Amitava, Kalina Jan
Ca' Foscari University of Venice, Venice, Italy.
XLRI-Xavier School of Management, Jamshedpur, India.
J Appl Stat. 2019 Jul 31;47(4):653-665. doi: 10.1080/02664763.2019.1649374. eCollection 2020.
Modern data collection techniques allow to analyze a very large number of endpoints. In biomedical research, for example, expressions of thousands of genes are commonly measured only on a small number of subjects. In these situations, traditional methods for comparison studies are not applicable. Moreover, the assumption of normal distribution is often questionable for high-dimensional data, and some variables may be at the same time highly correlated with others. Hypothesis tests based on interpoint distances are very appealing for studies involving the comparison of means, because they do not assume data to come from normally distributed populations and comprise tests that are distribution free, unbiased, consistent, and computationally feasible, even if the number of endpoints is much larger than the number of subjects. New tests based on interpoint distances are proposed for multivariate studies involving simultaneous comparison of means and variability, or the whole distribution shapes. The tests are shown to perform well in terms of power, when the endpoints have complex dependence relations, such as in genomic and metabolomic studies. A practical application to a genetic cardiovascular case-control study is discussed.
现代数据收集技术使得能够分析大量的终点指标。例如,在生物医学研究中,通常仅对少数受试者测量数千个基因的表达。在这些情况下,传统的比较研究方法并不适用。此外,对于高维数据,正态分布的假设往往存在疑问,并且一些变量可能同时与其他变量高度相关。基于点间距离的假设检验对于涉及均值比较的研究非常有吸引力,因为它们不假定数据来自正态分布总体,并且包括无分布、无偏、一致且计算可行的检验,即使终点指标的数量远大于受试者的数量。本文提出了基于点间距离的新检验方法,用于涉及均值和变异性同时比较或整个分布形状的多变量研究。当终点指标具有复杂的依赖关系时,如在基因组和代谢组学研究中,这些检验在功效方面表现良好。本文还讨论了在遗传性心血管病例对照研究中的实际应用。