Yuan Lei, Wang Yalin, Thompson Paul M, Narayan Vaibhav A, Ye Jieping
Center for Evolutionary Medicine and Informatics, The Biodesign Institute, ASU, Tempe, AZ ; Department of Computer Science and Engineering, ASU, Tempe, AZ.
KDD. 2012:1149-1157. doi: 10.1145/2339530.2339710.
Incomplete data present serious problems when integrating largescale brain imaging data sets from different imaging modalities. In the Alzheimer's Disease Neuroimaging Initiative (ADNI), for example, over half of the subjects lack cerebrospinal fluid (CSF) measurements; an independent half of the subjects do not have fluorodeoxyglucose positron emission tomography (FDG-PET) scans; many lack proteomics measurements. Traditionally, subjects with missing measures are discarded, resulting in a severe loss of available information. We address this problem by proposing two novel learning methods where all the samples (with at least one available data source) can be used. In the first method, we divide our samples according to the availability of data sources, and we learn shared sets of features with state-of-the-art sparse learning methods. Our second method learns a base classifier for each data source independently, based on which we represent each source using a single column of prediction scores; we then estimate the missing prediction scores, which, combined with the existing prediction scores, are used to build a multi-source fusion model. To illustrate the proposed approaches, we classify patients from the ADNI study into groups with Alzheimer's disease (AD), mild cognitive impairment (MCI) and normal controls, based on the multi-modality data. At baseline, ADNI's 780 participants (172 AD, 397 MCI, 211 Normal), have at least one of four data types: magnetic resonance imaging (MRI), FDG-PET, CSF and proteomics. These data are used to test our algorithms. Comprehensive experiments show that our proposed methods yield stable and promising results.
在整合来自不同成像模态的大规模脑成像数据集时,不完整数据会带来严重问题。例如,在阿尔茨海默病神经成像倡议(ADNI)中,超过一半的受试者缺乏脑脊液(CSF)测量数据;另有一半受试者没有进行氟脱氧葡萄糖正电子发射断层扫描(FDG-PET);许多人还缺乏蛋白质组学测量数据。传统上,有缺失测量值的受试者会被剔除,这导致可用信息严重丢失。我们提出了两种新颖的学习方法来解决这个问题,所有样本(至少有一个可用数据源)都可以使用。在第一种方法中,我们根据数据源的可用性对样本进行划分,并使用最先进的稀疏学习方法学习共享的特征集。我们的第二种方法是为每个数据源独立学习一个基础分类器,在此基础上,我们用单列预测分数来表示每个数据源;然后我们估计缺失的预测分数,将其与现有的预测分数相结合,用于构建多源融合模型。为了说明所提出的方法,我们基于多模态数据将ADNI研究中的患者分为患有阿尔茨海默病(AD)、轻度认知障碍(MCI)和正常对照组。在基线时,ADNI的780名参与者(172名AD患者、397名MCI患者、211名正常对照)至少拥有四种数据类型中的一种:磁共振成像(MRI)、FDG-PET、CSF和蛋白质组学。这些数据用于测试我们的算法。综合实验表明,我们提出的方法产生了稳定且有前景的结果。