Li Quefeng, Li Lexin
University of North Carolina, Chapel Hill and University of California, Berkeley.
J Am Stat Assoc. 2022;117(540):2207-2221. doi: 10.1080/01621459.2021.1914635. Epub 2021 May 20.
Multimodal data, where different types of data are collected from the same subjects, are fast emerging in a large variety of scientific applications. Factor analysis is commonly used in integrative analysis of multimodal data, and is particularly useful to overcome the curse of high dimensionality and high correlations. However, there is little work on statistical inference for factor analysis based supervised modeling of multimodal data. In this article, we consider an integrative linear regression model that is built upon the latent factors extracted from multimodal data. We address three important questions: how to infer the significance of one data modality given the other modalities in the model; how to infer the significance of a combination of variables from one modality or across different modalities; and how to quantify the contribution, measured by the goodness-of-fit, of one data modality given the others. When answering each question, we explicitly characterize both the benefit and the extra cost of factor analysis. Those questions, to our knowledge, have not yet been addressed despite wide use of factor analysis in integrative multimodal analysis, and our proposal bridges an important gap. We study the empirical performance of our methods through simulations, and further illustrate with a multimodal neuroimaging analysis.
多模态数据是指从同一受试者收集的不同类型的数据,在各种各样的科学应用中迅速兴起。因子分析常用于多模态数据的综合分析,尤其有助于克服高维度和高相关性的问题。然而,基于多模态数据的因子分析监督建模的统计推断方面的工作却很少。在本文中,我们考虑一个基于从多模态数据中提取的潜在因子构建的综合线性回归模型。我们解决三个重要问题:在模型中给定其他模态的情况下,如何推断一种数据模态的显著性;如何推断来自一种模态或跨不同模态的变量组合的显著性;以及如何在给定其他数据模态的情况下,量化一种数据模态以拟合优度衡量的贡献。在回答每个问题时,我们明确刻画了因子分析的益处和额外成本。据我们所知,尽管因子分析在综合多模态分析中被广泛使用,但这些问题尚未得到解决,我们的提议填补了一个重要空白。我们通过模拟研究了我们方法的实证性能,并通过多模态神经影像分析进一步进行了说明。