Jee Justin, Fong Christopher, Pichotta Karl, Tran Thinh Ngoc, Luthra Anisha, Waters Michele, Fu Chenlian, Altoe Mirella, Liu Si-Yang, Maron Steven B, Ahmed Mehnaj, Kim Susie, Pirun Mono, Chatila Walid K, de Bruijn Ino, Pasha Arfath, Kundra Ritika, Gross Benjamin, Mastrogiacomo Brooke, Aprati Tyler J, Liu David, Gao JianJiong, Capelletti Marzia, Pekala Kelly, Loudon Lisa, Perry Maria, Bandlamudi Chaitanya, Donoghue Mark, Satravada Baby Anusha, Martin Axel, Shen Ronglai, Chen Yuan, Brannon A Rose, Chang Jason, Braunstein Lior, Li Anyi, Safonov Anton, Stonestrom Aaron, Sanchez-Vela Pablo, Wilhelm Clare, Robson Mark, Scher Howard, Ladanyi Marc, Reis-Filho Jorge S, Solit David B, Jones David R, Gomez Daniel, Yu Helena, Chakravarty Debyani, Yaeger Rona, Abida Wassim, Park Wungki, O'Reilly Eileen M, Garcia-Aguilar Julio, Socci Nicholas, Sanchez-Vega Francisco, Carrot-Zhang Jian, Stetson Peter D, Levine Ross, Rudin Charles M, Berger Michael F, Shah Sohrab P, Schrag Deborah, Razavi Pedram, Kehl Kenneth L, Li Bob T, Riely Gregory J, Schultz Nikolaus
Memorial Sloan Kettering Cancer Center, New York, NY, USA.
Dana Farber Cancer Institute, Boston, MA, USA.
Nature. 2024 Dec;636(8043):728-736. doi: 10.1038/s41586-024-08167-5. Epub 2024 Nov 6.
The digitization of health records and growing availability of tumour DNA sequencing provide an opportunity to study the determinants of cancer outcomes with unprecedented richness. Patient data are often stored in unstructured text and siloed datasets. Here we combine natural language processing annotations with structured medication, patient-reported demographic, tumour registry and tumour genomic data from 24,950 patients at Memorial Sloan Kettering Cancer Center to generate a clinicogenomic, harmonized oncologic real-world dataset (MSK-CHORD). MSK-CHORD includes data for non-small-cell lung (n = 7,809), breast (n = 5,368), colorectal (n = 5,543), prostate (n = 3,211) and pancreatic (n = 3,109) cancers and enables discovery of clinicogenomic relationships not apparent in smaller datasets. Leveraging MSK-CHORD to train machine learning models to predict overall survival, we find that models including features derived from natural language processing, such as sites of disease, outperform those based on genomic data or stage alone as tested by cross-validation and an external, multi-institution dataset. By annotating 705,241 radiology reports, MSK-CHORD also uncovers predictors of metastasis to specific organ sites, including a relationship between SETD2 mutation and lower metastatic potential in immunotherapy-treated lung adenocarcinoma corroborated in independent datasets. We demonstrate the feasibility of automated annotation from unstructured notes and its utility in predicting patient outcomes. The resulting data are provided as a public resource for real-world oncologic research.
健康记录的数字化以及肿瘤DNA测序的日益普及,为以前所未有的丰富程度研究癌症预后的决定因素提供了契机。患者数据通常存储在非结构化文本和孤立的数据集中。在此,我们将自然语言处理注释与纪念斯隆凯特琳癌症中心24950名患者的结构化用药、患者报告的人口统计学、肿瘤登记和肿瘤基因组数据相结合,以生成一个临床基因组学的、统一的肿瘤学真实世界数据集(MSK-CHORD)。MSK-CHORD包括非小细胞肺癌(n = 7809)、乳腺癌(n = 5368)、结直肠癌(n = 5543)、前列腺癌(n = 3211)和胰腺癌(n = 3109)的数据,并能够发现较小数据集中不明显的临床基因组学关系。利用MSK-CHORD训练机器学习模型来预测总生存期,我们发现,通过交叉验证和一个外部多机构数据集测试,包括从自然语言处理中衍生出的特征(如疾病部位)的模型,其表现优于仅基于基因组数据或分期的模型。通过注释705241份放射学报告,MSK-CHORD还发现了转移至特定器官部位的预测因素,包括SETD2突变与免疫治疗的肺腺癌中较低转移潜能之间的关系,这在独立数据集中得到了证实。我们证明了从非结构化笔记中进行自动注释的可行性及其在预测患者预后方面的效用。所得数据作为真实世界肿瘤学研究的公共资源提供。