Suppr超能文献

基于分子特征的机器学习模型的开发与验证,用于估计患有多个非小细胞肺癌的患者发生多原发性肺癌与肺内转移的概率。

Development and validation of machine learning models based on molecular features for estimating the probability of multiple primary lung carcinoma versus intrapulmonary metastasis in patients presenting multiple non-small cell lung cancers.

作者信息

Liu Ning, Li Xue, Luo Xu, Liu Bin, Tang Jie, Xiao Fei, Wang Weiya, Tang Yuan, Shu Pei, Zhang Benxia, Chen Yue, Qin Diyuan, Ma Qizhi, Guo Fuchun, Tang Xiaojun, Zhu Daxing, Mei Jiandong, Chen Weizhi, Li Dan, Jiang Lili, Wang Yongsheng

机构信息

Division of Thoracic Tumor Multimodality Treatment, Cancer Center, West China Hospital, Sichuan University, Chengdu, China.

Chengdu Institute of Computer Application, Chinese Academy of Sciences, Chengdu, China.

出版信息

Transl Lung Cancer Res. 2025 Apr 30;14(4):1118-1137. doi: 10.21037/tlcr-24-875. Epub 2025 Apr 25.

Abstract

BACKGROUND

Discrimination of multiple non-small cell lung cancers (NSCLCs) as multiple primary lung cancers (MPLCs) or intrapulmonary metastases (IPMs) is critical but remains challenging. The aim of this study is to develop and validate the machine learning (ML) models based on the molecular features for estimating the probability of MPLC or IPM for patients presenting multiple NSCLCs.

METHODS

A total of 72 multiple NSCLCs patients with 157 surgical resection tumor lesions from January 2012 to January 2018 at two institutions were included for developing and testing models. Specifically, 46 patients with 103 tumors which were defined as definitive MPLC or IPM according to International Association for the Study of Lung Cancer (IASLC) criteria were used to develop models. They were spilt into training and validation sets using stratified random sampling and five-fold cross-validation. The developed models were tested in other 26 patients whose tumors were undetermined by traditional methods. Whole-exome sequencing (WES) was performed on all included tumor samples. Four molecular features were calculated to characterize tumors relatedness and served as model inputs, including genetic divergence, shared mutation number, Pearson correlation coefficient and early mutation number. Decision trees (DT), random forests (RF), and gradient boosting decision trees (GBDT) were employed, with performance assessed by areas under the curve (AUCs), accuracy, precision, recall, and F1 score in validation set. Disease-free survival (DFS) were used to evaluate model performance in test cohort. Clinical and genetic characteristics were then compared between MPLC and IPM populations.

RESULTS

All of the four molecular features showed significant differences between MPLC and IPM patients in development cohort. That is, MPLC exhibited higher genetic divergence, lower shared mutation number, Pearson correlation and early mutation number than IPM (P<0.001). DT model, RF model and GBDT model were developed with these factors and achieved a mean AUC of 0.94 [standard deviation (SD) 0.09], 1.00 (SD 0.00) and 1.00 (SD 0.00) in validation set, respectively. DT model, RF model and GBDT model discriminated the undetermined multiple NSCLCs as MPLC (n=15) and IPM (n=11) consistently. MPLC identified by ML models had significantly prolonged DFS [hazard ratio =0.21; 95% confidence interval (CI): 0.04-1.0; P=0.04] than that of IPM. MPLC patients had a relative higher prevalence of family history of first-degree relatives with cancer, and more than half of these patients reported a family history of lung cancer. EGFR remains the most common mutated driver both in MPLC and IPM populations.

CONCLUSIONS

ML models based on the molecular features effectively distcriminate primary tumors from metastases in multiple NSCLCs, which improve the accuracy of multiple NSCLCs diagnosis and assist in clinical decision-making, particularly in challenging cases.

摘要

背景

鉴别多发性非小细胞肺癌(NSCLC)是多原发性肺癌(MPLC)还是肺内转移瘤(IPM)至关重要,但仍具有挑战性。本研究旨在开发并验证基于分子特征的机器学习(ML)模型,以估计患有多发性NSCLC患者发生MPLC或IPM的概率。

方法

纳入2012年1月至2018年1月在两家机构接受手术切除的72例患有多发性NSCLC的患者,共157个肿瘤病变,用于模型的开发和测试。具体而言,根据国际肺癌研究协会(IASLC)标准,将46例患者的103个肿瘤定义为确诊的MPLC或IPM,用于模型开发。使用分层随机抽样和五折交叉验证将它们分为训练集和验证集。在另外26例肿瘤经传统方法无法确定的患者中测试所开发的模型。对所有纳入的肿瘤样本进行全外显子测序(WES)。计算四个分子特征以表征肿瘤相关性,并用作模型输入,包括基因差异、共享突变数、皮尔逊相关系数和早期突变数。采用决策树(DT)、随机森林(RF)和梯度提升决策树(GBDT),通过验证集中的曲线下面积(AUC)、准确性、精确性、召回率和F1分数评估性能。无病生存期(DFS)用于评估测试队列中的模型性能。然后比较MPLC和IPM人群的临床和遗传特征。

结果

在开发队列中,所有四个分子特征在MPLC和IPM患者之间均显示出显著差异。也就是说,MPLC表现出比IPM更高的基因差异、更低的共享突变数、皮尔逊相关性和早期突变数(P<0.001)。利用这些因素开发了DT模型、RF模型和GBDT模型,在验证集中的平均AUC分别为0.94[标准差(SD)0.09]、1.00(SD 0.00)和1.00(SD 0.00)。DT模型、RF模型和GBDT模型一致地将未确定的多发性NSCLC鉴别为MPLC(n = 15)和IPM(n = 11)。通过ML模型鉴定的MPLC的DFS显著延长[风险比=0.21;95%置信区间(CI):0.04 - 1.0;P = 该模型在测试队列中的性能。然后比较MPLC和IPM人群的临床和遗传特征。

结果

在开发队列中,所有四个分子特征在MPLC和IPM患者之间均显示出显著差异。也就是说,MPLC表现出比IPM更高的基因差异、更低的共享突变数、皮尔逊相关性和早期突变数(P<0.001)。利用这些因素开发了DT模型、RF模型和GBDT模型,在验证集中DT模型、RF模型和GBDT模型的平均AUC分别为0.94[标准差(SD)0.09]、1.00(SD 0.00)和1.00(SD 0.00)。DT模型、RF模型和GBDT模型一致地将未确定的多发性NSCLC鉴别为MPLC(n = 15)和IPM(n = 11)。通过ML模型鉴定的MPLC的DFS显著延长[风险比=0.21;95%置信区间(CI):0.04 - 1.0;P = 0.04],比IPM的DFS更长。MPLC患者一级亲属患癌家族史的患病率相对较高,其中超过一半的患者报告有肺癌家族史。EGFR仍然是MPLC和IPM人群中最常见的突变驱动基因。

结论

基于分子特征的ML模型有效地将多发性NSCLC中的原发性肿瘤与转移瘤区分开来,提高了多发性NSCLC诊断的准确性,并有助于临床决策,特别是在具有挑战性的病例中。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be9c/12082235/83d7ad3ef7bb/tlcr-14-04-1118-f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验