Lyu Tianmeng, Lock Eric F, Eberly Lynn E
Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA.
Biostatistics. 2017 Jul 1;18(3):434-450. doi: 10.1093/biostatistics/kxw057.
High-dimensional linear classifiers, such as distance weighted discrimination (DWD) and versions of the support vector machine (SVM), are commonly used in biomedical research to distinguish groups of subjects based on a large number of features. However, their use is limited to applications where a single vector of features is measured for each subject. In practice, data are often multi-way, or measured over multiple dimensions. For example, metabolite abundance may be measured over multiple regions or tissues, or gene expression may be measured over multiple time points, for the same subjects. We propose a framework for linear classification of high-dimensional multi-way data, in which coefficients can be factorized into weights that are specific to each dimension. More generally, the coefficients for each measurement in a multi-way dataset are assumed to have low-rank structure. This framework extends existing classification techniques from single vector to multi-way features, and we have implemented multi-way versions of SVM and DWD. We describe informative simulation results, and apply multi-way DWD to data for two very different clinical research studies. The first study uses magnetic resonance spectroscopy metabolite data over multiple brain regions to compare participants with and without spinocerebellar ataxia; the second uses publicly available gene expression time-course data to compare degrees of treatment response among patients with multiple sclerosis. Our multi-way method can improve performance and simplify interpretation over naive applications of full rank linear and non-linear classification to multi-way data. The R package is available at https://github.com/lockEF/MultiwayClassification.
高维线性分类器,如距离加权判别法(DWD)和支持向量机(SVM)的多种版本,在生物医学研究中常用于根据大量特征区分不同的受试者群体。然而,它们的应用仅限于对每个受试者测量单个特征向量的情况。在实际中,数据往往是多向的,或者是在多个维度上进行测量的。例如,对于同一受试者,代谢物丰度可能在多个区域或组织上进行测量,或者基因表达可能在多个时间点上进行测量。我们提出了一个用于高维多向数据线性分类的框架,其中系数可以分解为特定于每个维度的权重。更一般地说,多向数据集中每个测量的系数都假定具有低秩结构。该框架将现有的分类技术从单向量特征扩展到多向特征,并且我们已经实现了SVM和DWD的多向版本。我们描述了信息丰富的模拟结果,并将多向DWD应用于两项截然不同的临床研究的数据。第一项研究使用多个脑区的磁共振波谱代谢物数据来比较患有和未患有脊髓小脑共济失调的参与者;第二项研究使用公开可用的基因表达时间序列数据来比较多发性硬化症患者的治疗反应程度。与将满秩线性和非线性分类简单应用于多向数据相比,我们的多向方法可以提高性能并简化解释。R包可在https://github.com/lockEF/MultiwayClassification获取。