Landeros Alfonso, Ko Seyoon, Chang Jack Z, Wu Tong Tong, Lange Kenneth
Department of Statistics, University of California, Riverside, CA, 92521-0001, United States of America.
Departments of Mathematics and Biostatistics, University of California, Los Angeles, CA, 90095-1554, United States of America.
Comput Stat Data Anal. 2025 Jun;206. doi: 10.1016/j.csda.2025.108125. Epub 2025 Jan 7.
Modern biomedical datasets are often high-dimensional at multiple levels of biological organization. Practitioners must therefore grapple with data to estimate sparse or low-rank structures so as to adhere to the principle of parsimony. Further complicating matters is the presence of groups in data, each of which may have distinct associations with explanatory variables or be characterized by fundamentally different covariates. These themes in data analysis are explored in the context of classification. Vertex Discriminant Analysis (VDA) offers flexible linear and nonlinear models for classification that generalize the advantages of support vector machines to data with multiple classes. The proximal distance principle, which leverages projection and proximal operators in the design of practical algorithms, handily facilitates variable selection in VDA via nonconvex distance-to-set penalties directly controlling the number of active variables. Two flavors of sparse VDA are developed to address data in which instances may be homogeneous or heterogeneous with respect to predictors characterizing classes. Empirical studies illustrate how VDA is adapted to class-specific variable selection on simulated and real datasets, with an emphasis on applications to cancer classification via gene expression patterns.
现代生物医学数据集在生物组织的多个层面上往往是高维的。因此,从业者必须处理数据以估计稀疏或低秩结构,从而遵循简约原则。使情况更加复杂的是数据中存在分组,每组可能与解释变量有不同的关联,或者具有根本不同的协变量特征。在分类的背景下探讨了数据分析中的这些主题。顶点判别分析(VDA)为分类提供了灵活的线性和非线性模型,将支持向量机的优势推广到多类数据。近端距离原则在实用算法设计中利用投影和近端算子,通过直接控制活跃变量数量的非凸到集惩罚,方便地促进了VDA中的变量选择。开发了两种稀疏VDA来处理实例在表征类别的预测变量方面可能是同质或异质的数据。实证研究说明了VDA如何适用于模拟和真实数据集上的类特定变量选择,重点是通过基因表达模式在癌症分类中的应用。