Tang Lu, Song Peter X K
Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA.
J Mach Learn Res. 2016;17.
As data sets of related studies become more easily accessible, combining data sets of similar studies is often undertaken in practice to achieve a larger sample size and higher power. A major challenge arising from data integration pertains to data heterogeneity in terms of study population, study design, or study coordination. Ignoring such heterogeneity in data analysis may result in biased estimation and misleading inference. Traditional techniques of remedy to data heterogeneity include the use of interactions and random effects, which are inferior to achieving desirable statistical power or providing a meaningful interpretation, especially when a large number of smaller data sets are combined. In this paper, we propose a regularized fusion method that allows us to identify and merge inter-study homogeneous parameter clusters in regression analysis, without the use of hypothesis testing approach. Using the fused lasso, we establish a computationally efficient procedure to deal with large-scale integrated data. Incorporating the estimated parameter ordering in the fused lasso facilitates computing speed with no loss of statistical power. We conduct extensive simulation studies and provide an application example to demonstrate the performance of the new method with a comparison to the conventional methods.
随着相关研究数据集变得更容易获取,在实践中常常会合并相似研究的数据集以获得更大的样本量和更高的检验效能。数据整合带来的一个主要挑战涉及到研究人群、研究设计或研究协调方面的数据异质性。在数据分析中忽略这种异质性可能会导致有偏差的估计和误导性的推断。传统的数据异质性补救技术包括使用交互作用和随机效应,但这些方法在实现理想的统计效能或提供有意义的解释方面效果欠佳,尤其是在合并大量较小的数据集时。在本文中,我们提出一种正则化融合方法,该方法使我们能够在回归分析中识别并合并研究间的同质参数簇,而无需使用假设检验方法。使用融合套索,我们建立了一种计算效率高的程序来处理大规模整合数据。在融合套索中纳入估计参数排序可提高计算速度且不会损失统计效能。我们进行了广泛的模拟研究,并提供了一个应用示例来展示新方法与传统方法相比的性能。