Li Yujia, Liu Peng, Wang Wenjia, Zong Wei, Fang Yusi, Ren Zhao, Tang Lu, Celedón Juan C, Oesterreich Steffi, Tseng George C
University of Pittsburgh.
Ann Appl Stat. 2024 Sep;18(3):1947-1964. doi: 10.1214/23-aoas1865. Epub 2024 Aug 5.
With advances in high-throughput technology, molecular disease subtyping by high-dimensional omics data has been recognized as an effective approach for identifying subtypes of complex diseases with distinct disease mechanisms and prognoses. Conventional cluster analysis takes omics data as input and generates patient clusters with similar gene expression pattern. The omics data, however, usually contain multi-faceted cluster structures that can be defined by different sets of gene. If the gene set associated with irrelevant clinical variables (e.g., sex or age) dominates the clustering process, the resulting clusters may not capture clinically meaningful disease subtypes. This motivates the development of a clustering framework with guidance from a pre-specified disease outcome, such as lung function measurement or survival, in this paper. We propose two disease subtyping methods by omics data with outcome guidance using a generative model or a weighted joint likelihood. Both methods connect an outcome association model and a disease subtyping model by a latent variable of cluster labels. Compared to the generative model, weighted joint likelihood contains a data-driven weight parameter to balance the likelihood contributions from outcome association and gene cluster separation, which improves generalizability in independent validation but requires heavier computing. Extensive simulations and two real applications in lung disease and triple-negative breast cancer demonstrate superior disease subtyping performance of the outcome-guided clustering methods in terms of disease subtyping accuracy, gene selection and outcome association. Unlike existing clustering methods, the outcome-guided disease subtyping framework creates a new precision medicine paradigm to directly identify patient subgroups with clinical association.
随着高通量技术的进步,通过高维组学数据进行分子疾病亚型分类已被认为是识别具有不同疾病机制和预后的复杂疾病亚型的有效方法。传统的聚类分析将组学数据作为输入,并生成具有相似基因表达模式的患者聚类。然而,组学数据通常包含多方面的聚类结构,这些结构可以由不同的基因集定义。如果与无关临床变量(如性别或年龄)相关的基因集在聚类过程中占主导地位,那么得到的聚类可能无法捕捉到具有临床意义的疾病亚型。这促使本文开发一种在预先指定的疾病结局(如肺功能测量或生存率)指导下的聚类框架。我们提出了两种通过组学数据进行疾病亚型分类的方法,利用生成模型或加权联合似然性在结局指导下进行。这两种方法都通过聚类标签的潜在变量将结局关联模型和疾病亚型分类模型联系起来。与生成模型相比,加权联合似然性包含一个数据驱动的权重参数,以平衡结局关联和基因聚类分离的似然贡献,这在独立验证中提高了泛化能力,但计算量更大。广泛的模拟以及在肺病和三阴性乳腺癌中的两个实际应用表明,结局指导的聚类方法在疾病亚型分类准确性、基因选择和结局关联方面具有卓越的疾病亚型分类性能。与现有聚类方法不同,结局指导的疾病亚型分类框架创建了一种新的精准医学范式,以直接识别具有临床关联性的患者亚组。