Suppr
超能文献

机器学习分类提高癌症中临床相关融合转录本的检测。

Improved detection of clinically relevant fusion transcripts in cancer by machine learning classification.

机构信息

Faculty of Medicine, Department of Clinical Sciences Lund, Oncology, Lund University Cancer Centre, Lund, Sweden.

Department of Physics, Chemistry and Biology, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Linköping University, Linköping, Sweden.

出版信息

BMC Genomics. 2023 Dec 18;24(1):783. doi: 10.1186/s12864-023-09889-y.

DOI:10.1186/s12864-023-09889-y

PMID:38110872

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10726539/

Abstract

BACKGROUND

Genomic rearrangements in cancer cells can create fusion genes that encode chimeric proteins or alter the expression of coding and non-coding RNAs. In some cancer types, fusions involving specific kinases are used as targets for therapy. Fusion genes can be detected by whole genome sequencing (WGS) and targeted fusion panels, but RNA sequencing (RNA-Seq) has the advantageous capability of broadly detecting expressed fusion transcripts.

RESULTS

We developed a pipeline for validation of fusion transcripts identified in RNA-Seq data using matched WGS data from The Cancer Genome Atlas (TCGA) and applied it to 910 tumors from 11 different cancer types. This resulted in 4237 validated gene fusions, 3049 of them with at least one identified genomic breakpoint. Utilizing validated fusions as true positive events, we trained a machine learning classifier to predict true and false positive fusion transcripts from RNA-Seq data. The final precision and recall metrics of the classifier were 0.74 and 0.71, respectively, in an independent dataset of 249 breast tumors. Application of this classifier to all samples with RNA-Seq data from these cancer types vastly extended the number of likely true positive fusion transcripts and identified many potentially targetable kinase fusions. Further analysis of the validated gene fusions suggested that many are created by intrachromosomal amplification events with microhomology-mediated non-homologous end-joining.

CONCLUSIONS

A classifier trained on validated fusion events increased the accuracy of fusion transcript identification in samples without WGS data. This allowed the analysis to be extended to all samples with RNA-Seq data, facilitating studies of tumor biology and increasing the number of detected kinase fusions. Machine learning could thus be used in identification of clinically relevant fusion events for targeted therapy. The large dataset of validated gene fusions generated here presents a useful resource for development and evaluation of fusion transcript detection algorithms.

摘要

背景

癌细胞中的基因组重排可以产生融合基因，这些基因编码嵌合蛋白或改变编码和非编码 RNA 的表达。在某些癌症类型中，涉及特定激酶的融合被用作治疗靶点。融合基因可以通过全基因组测序 (WGS) 和靶向融合面板检测，但 RNA 测序 (RNA-Seq) 具有广泛检测表达融合转录本的优势。

结果

我们开发了一种使用来自癌症基因组图谱 (TCGA) 的匹配 WGS 数据验证 RNA-Seq 数据中鉴定的融合转录本的管道，并将其应用于 11 种不同癌症类型的 910 个肿瘤。这导致了 4237 个经过验证的基因融合，其中 3049 个融合至少有一个鉴定的基因组断点。利用经过验证的融合作为真阳性事件，我们训练了一个机器学习分类器，从 RNA-Seq 数据中预测真阳性和假阳性融合转录本。该分类器在来自 249 个乳腺癌肿瘤的独立数据集上的最终精度和召回率指标分别为 0.74 和 0.71。将此分类器应用于具有这些癌症类型 RNA-Seq 数据的所有样本，大大扩展了可能的真阳性融合转录本的数量，并鉴定了许多潜在的可靶向激酶融合。对经过验证的基因融合的进一步分析表明，许多融合是由带有微同源介导的非同源末端连接的染色体内扩增事件产生的。

结论

基于经过验证的融合事件训练的分类器提高了无 WGS 数据样本中融合转录本鉴定的准确性。这允许将分析扩展到具有 RNA-Seq 数据的所有样本，促进肿瘤生物学研究，并增加检测到的激酶融合数量。机器学习因此可用于鉴定具有临床相关性的融合事件，以进行靶向治疗。这里生成的经过验证的基因融合大型数据集为融合转录本检测算法的开发和评估提供了有用的资源。