Chambuso Ramadhani, Musarurwa Takudzwa Nyasha, Aldera Alessandro Pietro, Deffur Armin, Geffen Hayli, Perkins Douglas, Ramesar Raj
Department of Global Health and Population, Harvard T. Chan School of Public Health, Boston, MA, USA.
UCT/MRC Genomics and Precision Medicine Research Unit, Division of Human Genetics, Department of Pathology, University of Cape Town, Cape Town, South Africa.
BJC Rep. 2025 May 5;3(1):30. doi: 10.1038/s44276-025-00140-7.
Lynch syndrome (LS) screening methods include multistep molecular somatic tumor testing to distinguish likely-LS patients from sporadic cases, which can be costly and complex. Also, direct germline testing for LS for every diagnosed solid cancer patient is a challenge in resource limited settings. We developed a unique machine learning scoring model to ascertain likely-LS cases from a cohort of colorectal cancer (CRC) patients.
We used CRC patients from the cBioPortal database (TCGA studies) with complete clinicopathologic and somatic genomics data. We determined the rate of pathogenic/likely pathogenic variants in five (5) LS genes (MLH1, MSH2, MSH6, PMS2, EPCAM), and the BRAF mutations using a pre-designed bioinformatic annotation pipeline. Annovar, Intervar, Variant Effect Predictor (VEP), and OncoKB software tools were used to functionally annotate and interpret somatic variants detected. The OncoKB precision oncology knowledge base was used to provide information on the effects of the identified variants. We scored the clinicopathologic and somatic genomics data automatically using a machine learning model to discriminate between likely-LS and sporadic CRC cases. The training and testing datasets comprised of 80% and 20% of the total CRC patients, respectively. Group regularisation methods in combination with 10-fold cross-validation were performed for feature selection on the training data.
Out of 4800 CRC patients frorm the TCGA datasets with clinicopathological and somatic genomics data, we ascertained 524 patients with complete data. The scoring model using both clinicopathological and genetic characteristics for likely-LS showed a sensitivity and specificity of 100%, and both had the maximum accuracy, area under the curve (AUC) and AUC for precision-recall (AUCPR) of 1. In a similar analysis, the training and testing models that only relied on clinical or pathological characteristics had a sensitivity of 0.88 and 0.50, specificity of 0.55 and 0.51, accuracy of 0.58 and 0.51, AUC of 0.74 and 0.61, and AUCPR of 0.21 and 0.19, respectively.
Simultaneous scoring of LS clinicopathological and somatic genomics data can improve prediction and ascertainment for likely-LS from all CRC cases. This approach can increase accuracy while reducing the reliance on expensive direct germline testing for all CRC patients, making LS screening more accessible and cost-effective, especially in resource-limited settings.
林奇综合征(LS)的筛查方法包括多步骤分子体细胞肿瘤检测,以区分可能患有LS的患者和散发性病例,这可能成本高昂且复杂。此外,对每一位确诊的实体癌患者进行LS的直接种系检测在资源有限的环境中是一项挑战。我们开发了一种独特的机器学习评分模型,以从一组结直肠癌(CRC)患者中确定可能患有LS的病例。
我们使用来自cBioPortal数据库(TCGA研究)的CRC患者,其具有完整的临床病理和体细胞基因组学数据。我们使用预先设计的生物信息学注释管道确定五个(5个)LS基因(MLH1、MSH2、MSH6、PMS2、EPCAM)中的致病/可能致病变异率,以及BRAF突变。使用Annovar、Intervar、变异效应预测器(VEP)和OncoKB软件工具对检测到的体细胞变异进行功能注释和解释。OncoKB精准肿瘤知识库用于提供有关已识别变异影响的信息。我们使用机器学习模型自动对临床病理和体细胞基因组学数据进行评分,以区分可能患有LS的CRC病例和散发性CRC病例。训练和测试数据集分别占CRC患者总数的80%和20%。对训练数据进行组正则化方法结合10折交叉验证以进行特征选择。
在来自TCGA数据集的4800例具有临床病理和体细胞基因组学数据的CRC患者中,我们确定了524例数据完整的患者。使用临床病理和遗传特征对可能患有LS的患者进行评分的模型显示敏感性和特异性均为100%,且两者的最大准确度、曲线下面积(AUC)和精确召回率曲线下面积(AUCPR)均为1。在类似分析中,仅依赖临床或病理特征的训练和测试模型的敏感性分别为0.88和0.50,特异性分别为0.55和0.51,准确度分别为0.58和0.51,AUC分别为0.74和0.61,AUCPR分别为0.21和0.19。
对LS临床病理和体细胞基因组学数据进行同时评分可以提高从所有CRC病例中预测和确定可能患有LS的患者的能力。这种方法可以提高准确性,同时减少对所有CRC患者进行昂贵的直接种系检测的依赖,使LS筛查更容易获得且更具成本效益,尤其是在资源有限的环境中。