Chang Changgee, Oh Jihwan, Long Qi
Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania.
Proc SIAM Int Conf Data Min. 2020;2020:604-612. doi: 10.1137/1.9781611976236.68.
Integrative analysis jointly analyzes multiple data sets to overcome curse of dimensionality. It can detect important but weak signals by jointly selecting features for all data sets, but unfortunately the sets of important features are not always the same for all data sets. Variations which allows heterogeneous sparsity structure-a subset of data sets can have a zero coefficient for a selected feature-have been proposed, but it compromises the effect of integrative analysis recalling the problem of losing weak important signals. We propose a new integrative analysis approach which not only aggregates weak important signals well in homogeneity setting but also substantially alleviates the problem of losing weak important signals in heterogeneity setting. Our approach exploits a priori known graphical structure of features by forcing joint selection of adjacent features, and integrating such information over multiple data sets can increase the power while taking into account the heterogeneity across data sets. We confirm the problem of existing approaches and demonstrate the superiority of our method through a simulation study and an application to gene expression data from ADNI.
整合分析通过联合分析多个数据集来克服维度灾难。它可以通过为所有数据集联合选择特征来检测重要但微弱的信号,但遗憾的是,对于所有数据集而言,重要特征集并不总是相同的。已经提出了允许异质稀疏结构的变体——数据集的一个子集对于选定特征可以具有零系数——但这会损害整合分析的效果,让人想起丢失微弱重要信号的问题。我们提出了一种新的整合分析方法,该方法不仅能在同质性设置中很好地聚合微弱重要信号,而且能在异质性设置中大幅缓解丢失微弱重要信号的问题。我们的方法通过强制联合选择相邻特征来利用先验已知的特征图形结构,并且在考虑数据集间异质性的同时,跨多个数据集整合此类信息可以提高功效。我们通过模拟研究以及对来自ADNI的基因表达数据的应用,证实了现有方法存在的问题,并证明了我们方法的优越性。