Yu Ming, Natesan Ramamurthy Karthikeyan, Thompson Addie, Lozano Aurélie C
Booth School of Business, The University of Chicago, Chicago, IL, United States.
IBM Research, Yorktown Heights, NY, United States.
Front Big Data. 2019 Aug 14;2:27. doi: 10.3389/fdata.2019.00027. eCollection 2019.
We consider multi-response and multi-task regression models, where the parameter matrix to be estimated is expected to have an unknown grouping structure. The groupings can be along tasks, or features, or both, the last one indicating a bi-cluster or "checkerboard" structure. Discovering this grouping structure along with parameter inference makes sense in several applications, such as multi-response Genome-Wide Association Studies (GWAS). By inferring this additional structure we can obtain valuable information on the underlying data mechanisms (e.g., relationships among genotypes and phenotypes in GWAS). In this paper, we propose two formulations to simultaneously learn the parameter matrix and its group structures, based on convex regularization penalties. We present optimization approaches to solve the resulting problems and provide numerical convergence guarantees. Extensive experiments demonstrate much better clustering quality compared to other methods, and our approaches are also validated on real datasets concerning phenotypes and genotypes of plant varieties.
我们考虑多响应和多任务回归模型,其中待估计的参数矩阵预计具有未知的分组结构。分组可以沿着任务、特征或两者进行,最后一种情况表示双聚类或“棋盘”结构。在诸如多响应全基因组关联研究(GWAS)等多种应用中,发现这种分组结构并进行参数推断是有意义的。通过推断这种额外的结构,我们可以获得关于潜在数据机制的有价值信息(例如,GWAS中基因型和表型之间的关系)。在本文中,我们基于凸正则化惩罚提出了两种公式,以同时学习参数矩阵及其分组结构。我们提出了优化方法来解决由此产生的问题,并提供数值收敛保证。大量实验表明,与其他方法相比,我们的聚类质量要好得多,并且我们的方法也在关于植物品种表型和基因型的真实数据集上得到了验证。