Chen Yuan, Shen Ronglai, Feng Xiwen, Panageas Katherine
Department of Epidemiology & Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY 10017, United States.
Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, United States.
Biometrics. 2024 Oct 3;80(4). doi: 10.1093/biomtc/ujae146.
Cancer is a complex disease driven by genomic alterations, and tumor sequencing is becoming a mainstay of clinical care for cancer patients. The emergence of multi-institution sequencing data presents a powerful resource for learning real-world evidence to enhance precision oncology. GENIE BPC, led by American Association for Cancer Research, establishes a unique database linking genomic data with clinical information for patients treated at multiple cancer centers. However, leveraging sequencing data from multiple institutions presents significant challenges. Variability in gene panels can lead to loss of information when analyses focus on genes common across panels. Additionally, differences in sequencing techniques and patient heterogeneity across institutions add complexity. High data dimensionality, sparse gene mutation patterns, and weak signals at the individual gene level further complicate matters. Motivated by these real-world challenges, we introduce the Bridge model. It uses a quantile-matched latent variable approach to derive integrated features to preserve information beyond common genes and maximize the utilization of all available data, while leveraging information sharing to enhance both learning efficiency and the model's capacity to generalize. By extracting harmonized and noise-reduced lower-dimensional latent variables, the true mutation pattern unique to each individual is captured. We assess model's performance and parameter estimation through extensive simulation studies. The extracted latent features from the Bridge model consistently excel in predicting patient survival across six cancer types in GENIE BPC data.
癌症是一种由基因组改变驱动的复杂疾病,肿瘤测序正成为癌症患者临床护理的主要手段。多机构测序数据的出现为获取真实世界证据以提高精准肿瘤学水平提供了强大资源。由美国癌症研究协会牵头的GENIE BPC建立了一个独特的数据库,将多个癌症中心治疗患者的基因组数据与临床信息相联系。然而,利用来自多个机构的测序数据面临重大挑战。当分析聚焦于各基因检测板共有的基因时,基因检测板的差异可能导致信息丢失。此外,各机构测序技术的差异以及患者异质性增加了复杂性。高数据维度、稀疏的基因突变模式以及单个基因水平上的微弱信号使情况进一步复杂化。受这些现实世界挑战的推动,我们引入了桥接模型。它采用分位数匹配的潜在变量方法来推导综合特征,以保留常见基因之外的信息并最大化所有可用数据的利用率,同时利用信息共享提高学习效率和模型的泛化能力。通过提取协调且降噪的低维潜在变量,捕捉每个个体独特的真实突变模式。我们通过广泛的模拟研究评估模型的性能和参数估计。从桥接模型中提取的潜在特征在预测GENIE BPC数据中六种癌症类型患者的生存情况时始终表现出色。