Palzer Elise F, Wendt Christine H, Bowler Russell P, Hersh Craig P, Safo Sandra E, Lock Eric F
Division of Biostatistics, University of Minnesota, Minneapolis, 55455, USA.
Division of Pulmonary, Allergy and Critical Care, University of Minnesota, Minneapolis, 55455, USA.
Comput Stat Data Anal. 2022 Nov;175. doi: 10.1016/j.csda.2022.107547. Epub 2022 Jun 14.
Analyzing multi-source data, which are multiple views of data on the same subjects, has become increasingly common in molecular biomedical research. Recent methods have sought to uncover underlying structure and relationships within and/or between the data sources, and other methods have sought to build a predictive model for an outcome using all sources. However, existing methods that do both are presently limited because they either (1) only consider data structure shared by all datasets while ignoring structures unique to each source, or (2) they extract underlying structures first without consideration to the outcome. The proposed method, supervised joint and individual variation explained (sJIVE), can simultaneously (1) identify shared (joint) and source-specific (individual) underlying structure and (2) build a linear prediction model for an outcome using these structures. These two components are weighted to compromise between explaining variation in the multi-source data and in the outcome. Simulations show sJIVE to outperform existing methods when large amounts of noise are present in the multi-source data. An application to data from the COPDGene study explores gene expression and proteomic patterns associated with lung function.
分析多源数据(即关于同一研究对象的多个数据视图)在分子生物医学研究中已变得越来越普遍。近期的方法试图揭示数据源内部和/或之间的潜在结构及关系,还有些方法试图使用所有数据源构建针对某一结果的预测模型。然而,目前能同时做到这两点的现有方法存在局限,因为它们要么(1)仅考虑所有数据集共有的数据结构,而忽略每个数据源特有的结构,要么(2)先提取潜在结构而不考虑结果。所提出的监督联合与个体变异解释(sJIVE)方法能够同时(1)识别共享(联合)和特定于数据源(个体)的潜在结构,以及(2)使用这些结构构建针对某一结果的线性预测模型。这两个部分会进行加权,以便在解释多源数据中的变异和结果中的变异之间达成平衡。模拟结果表明,当多源数据中存在大量噪声时,sJIVE的表现优于现有方法。对慢性阻塞性肺疾病基因(COPDGene)研究数据的一项应用探索了与肺功能相关的基因表达和蛋白质组学模式。