Institute for Integrative Biology of the Cell, UMR 9198, CEA, CNRS, Université Paris-Saclay, Gif-Sur-Yvette, France.
Institute of Biology, Université Paris Est Creteil, Creteil, Creteil, France.
BMC Cancer. 2021 Apr 12;21(1):394. doi: 10.1186/s12885-021-08021-1.
RNA-seq data are increasingly used to derive prognostic signatures for cancer outcome prediction. A limitation of current predictors is their reliance on reference gene annotations, which amounts to ignoring large numbers of non-canonical RNAs produced in disease tissues. A recently introduced kind of transcriptome classifier operates entirely in a reference-free manner, relying on k-mers extracted from patient RNA-seq data.
In this paper, we set out to compare conventional and reference-free signatures in risk and relapse prediction of prostate cancer. To compare the two approaches as fairly as possible, we set up a common procedure that takes as input either a k-mer count matrix or a gene expression matrix, extracts a signature and evaluates this signature in an independent dataset.
We find that both gene-based and k-mer based classifiers had similarly high performances for risk prediction and a markedly lower performance for relapse prediction. Interestingly, the reference-free signatures included a set of sequences mapping to novel lncRNAs or variable regions of cancer driver genes that were not part of gene-based signatures.
Reference-free classifiers are thus a promising strategy for the identification of novel prognostic RNA biomarkers.
RNA-seq 数据越来越多地被用于推导癌症预后的预测标记。目前预测器的一个局限性是它们依赖于参考基因注释,这相当于忽略了大量在疾病组织中产生的非规范 RNA。最近引入的一种转录组分类器完全在无参考的情况下运行,依赖于从患者 RNA-seq 数据中提取的 k-mer。
在本文中,我们着手比较前列腺癌风险和复发预测中的传统和无参考标记。为了尽可能公平地比较这两种方法,我们建立了一个共同的程序,该程序可以输入 k-mer 计数矩阵或基因表达矩阵,提取标记并在独立数据集上评估该标记。
我们发现,基于基因和 k-mer 的分类器在风险预测方面都具有相似的高性能,而在复发预测方面的性能明显较低。有趣的是,无参考标记集包括一组映射到新型 lncRNA 或癌症驱动基因可变区域的序列,这些序列不是基于基因的标记集的一部分。
因此,无参考分类器是识别新型预后 RNA 生物标志物的有前途的策略。