Xiang Shuo, Yuan Lei, Fan Wei, Wang Yalin, Thompson Paul M, Ye Jieping
School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, USA; Center for Evolutionary Medicine and Informatics, The Biodesign Institute, Arizona State University, Tempe, AZ, USA.
Huawei Noah's Ark Lab, Hong Kong.
Neuroimage. 2014 Nov 15;102 Pt 1:192-206. doi: 10.1016/j.neuroimage.2013.08.015. Epub 2013 Aug 27.
Bio-imaging technologies allow scientists to collect large amounts of high-dimensional data from multiple heterogeneous sources for many biomedical applications. In the study of Alzheimer's Disease (AD), neuroimaging data, gene/protein expression data, etc., are often analyzed together to improve predictive power. Joint learning from multiple complementary data sources is advantageous, but feature-pruning and data source selection are critical to learn interpretable models from high-dimensional data. Often, the data collected has block-wise missing entries. In the Alzheimer's Disease Neuroimaging Initiative (ADNI), most subjects have MRI and genetic information, but only half have cerebrospinal fluid (CSF) measures, a different half has FDG-PET; only some have proteomic data. Here we propose how to effectively integrate information from multiple heterogeneous data sources when data is block-wise missing. We present a unified "bi-level" learning model for complete multi-source data, and extend it to incomplete data. Our major contributions are: (1) our proposed models unify feature-level and source-level analysis, including several existing feature learning approaches as special cases; (2) the model for incomplete data avoids imputing missing data and offers superior performance; it generalizes to other applications with block-wise missing data sources; (3) we present efficient optimization algorithms for modeling complete and incomplete data. We comprehensively evaluate the proposed models including all ADNI subjects with at least one of four data types at baseline: MRI, FDG-PET, CSF and proteomics. Our proposed models compare favorably with existing approaches.
生物成像技术使科学家能够从多个异构源收集大量高维数据,用于许多生物医学应用。在阿尔茨海默病(AD)的研究中,神经成像数据、基因/蛋白质表达数据等通常会被一起分析,以提高预测能力。从多个互补数据源进行联合学习具有优势,但特征修剪和数据源选择对于从高维数据中学习可解释模型至关重要。通常,收集到的数据存在分块缺失的条目。在阿尔茨海默病神经成像倡议(ADNI)中,大多数受试者有MRI和遗传信息,但只有一半有脑脊液(CSF)测量数据,另一半有FDG-PET数据;只有一些人有蛋白质组学数据。在此,我们提出当数据分块缺失时如何有效整合来自多个异构数据源的信息。我们为完整的多源数据提出了一个统一的“双层”学习模型,并将其扩展到不完整数据。我们的主要贡献包括:(1)我们提出的模型统一了特征级和源级分析,包括几种现有的特征学习方法作为特殊情况;(2)不完整数据的模型避免了对缺失数据的插补,并提供了卓越的性能;它可以推广到具有分块缺失数据源的其他应用;(3)我们提出了用于对完整和不完整数据进行建模的高效优化算法。我们全面评估了所提出的模型,包括在基线时具有四种数据类型(MRI、FDG-PET、CSF和蛋白质组学)中至少一种的所有ADNI受试者。我们提出的模型与现有方法相比具有优势。