Jung Sungkyu, Qiao Xingye
Department of Statistics, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, U.S.A.
Department of Mathematical Sciences, Binghamton University, State University of New York, Binghamton, New York 13902-6000, U.S.A.
Biometrics. 2014 Sep;70(3):536-45. doi: 10.1111/biom.12164. Epub 2014 Mar 3.
Set classification problems arise when classification tasks are based on sets of observations as opposed to individual observations. In set classification, a classification rule is trained with N sets of observations, where each set is labeled with class information, and the prediction of a class label is performed also with a set of observations. Data sets for set classification appear, for example, in diagnostics of disease based on multiple cell nucleus images from a single tissue. Relevant statistical models for set classification are introduced, which motivate a set classification framework based on context-free feature extraction. By understanding a set of observations as an empirical distribution, we employ a data-driven method to choose those features which contain information on location and major variation. In particular, the method of principal component analysis is used to extract the features of major variation. Multidimensional scaling is used to represent features as vector-valued points on which conventional classifiers can be applied. The proposed set classification approaches achieve better classification results than competing methods in a number of simulated data examples. The benefits of our method are demonstrated in an analysis of histopathology images of cell nuclei related to liver cancer.
当分类任务基于观测集而非单个观测值时,就会出现集合分类问题。在集合分类中,使用N组观测值训练分类规则,其中每组观测值都带有类别信息,并且类别标签的预测也是基于一组观测值进行的。例如,基于来自单个组织的多个细胞核图像进行疾病诊断时,就会出现用于集合分类的数据集。引入了用于集合分类的相关统计模型,这些模型推动了基于上下文无关特征提取的集合分类框架。通过将一组观测值理解为经验分布,我们采用数据驱动的方法来选择那些包含位置和主要变化信息的特征。特别地,使用主成分分析方法来提取主要变化的特征。使用多维缩放将特征表示为向量值点,以便可以应用传统分类器。在一些模拟数据示例中,所提出的集合分类方法比竞争方法取得了更好的分类结果。我们的方法的优势在对与肝癌相关的细胞核组织病理学图像的分析中得到了证明。