Yu Guan, Li Quefeng, Shen Dinggang, Liu Yufeng
Department of Biostatistics, State University of New York at Buffalo.
Department of Biostatistics, University of North Carolina at Chapel Hill.
J Am Stat Assoc. 2020;115(531):1406-1419. doi: 10.1080/01621459.2019.1632079. Epub 2019 Jul 22.
In modern scientific research, data are often collected from multiple modalities. Since different modalities could provide complementary information, statistical prediction methods using multi-modality data could deliver better prediction performance than using single modality data. However, one special challenge for using multi-modality data is related to block-missing data. In practice, due to dropouts or the high cost of measures, the observations of a certain modality can be missing completely for some subjects. In this paper, we propose a new DIrect Sparse regression procedure using COvariance from Multi-modality data (DISCOM). Our proposed DISCOM method includes two steps to find the optimal linear prediction of a continuous response variable using block-missing multi-modality predictors. In the first step, rather than deleting or imputing missing data, we make use of all available information to estimate the covariance matrix of the predictors and the cross-covariance vector between the predictors and the response variable. The proposed new estimate of the covariance matrix is a linear combination of the identity matrix, the estimates of the intra-modality covariance matrix and the cross-modality covariance matrix. Flexible estimates for both the sub-Gaussian and heavy-tailed cases are considered. In the second step, based on the estimated covariance matrix and the estimated cross-covariance vector, an extended Lasso-type estimator is used to deliver a sparse estimate of the coefficients in the optimal linear prediction. The number of samples that are effectively used by DISCOM is the minimum number of samples with available observations from two modalities, which can be much larger than the number of samples with complete observations from all modalities. The effectiveness of the proposed method is demonstrated by theoretical studies, simulated examples, and a real application from the Alzheimer's Disease Neuroimaging Initiative. The comparison between DISCOM and some existing methods also indicates the advantages of our proposed method.
在现代科学研究中,数据通常从多种模态收集。由于不同模态可以提供互补信息,使用多模态数据的统计预测方法可能比使用单模态数据具有更好的预测性能。然而,使用多模态数据的一个特殊挑战与块缺失数据有关。在实践中,由于数据缺失或测量成本高昂,某些模态的观测值可能会在一些受试者中完全缺失。在本文中,我们提出了一种使用多模态数据协方差的直接稀疏回归程序(DISCOM)。我们提出的DISCOM方法包括两个步骤,用于使用块缺失的多模态预测变量找到连续响应变量的最优线性预测。在第一步中,我们不是删除或插补缺失数据,而是利用所有可用信息来估计预测变量的协方差矩阵以及预测变量与响应变量之间的交叉协方差向量。提出的协方差矩阵新估计是单位矩阵、模态内协方差矩阵估计和跨模态协方差矩阵估计的线性组合。考虑了次高斯和重尾情况下的灵活估计。在第二步中,基于估计的协方差矩阵和估计的交叉协方差向量,使用扩展的套索型估计器来给出最优线性预测中系数的稀疏估计。DISCOM有效使用的样本数量是具有来自两种模态的可用观测值的样本的最小数量,这可能比具有来自所有模态的完整观测值样本数量大得多。理论研究、模拟示例以及阿尔茨海默病神经影像倡议的实际应用证明了所提出方法的有效性。DISCOM与一些现有方法的比较也表明了我们所提出方法的优势。