Lysenkova Wiklander Mariya, Zachariah Dave, Krali Olga, Nordlund Jessica
Department of Medical Sciences, Uppsala University, Uppsala, Sweden.
SciLifeLab, Uppsala University, Uppsala, Sweden.
JCO Clin Cancer Inform. 2025 May;9:e2400324. doi: 10.1200/CCI-24-00324. Epub 2025 May 28.
Recent advances in machine learning have led to the development of classifiers that predict molecular subtypes of acute lymphoblastic leukemia (ALL) using RNA-sequencing (RNA-seq) data. Although these models have shown promising results, they often lack robust performance guarantees. The aim of this study was three-fold: to quantify the uncertainty of these classifiers, to provide prediction sets that control the false-negative rate (FNR), and to perform implicit error reduction by transforming incorrect predictions into uncertain predictions.
Conformal prediction (CP) is a distribution-agnostic framework for generating statistically calibrated prediction sets whose size reflects model uncertainty. In this study, we applied an extension called conformal risk control to three RNA-seq ALL subtype classifiers. Leveraging RNA-seq data from 1,227 patient samples taken at diagnosis, we developed a multiclass conformal predictor ALLCoP, which generates statistically guaranteed FNR-controlled prediction sets.
ALLCoP was able to create prediction sets with specified FNR tolerances ranging from 7.5% to 30%. In a validation cohort, ALLCoP successfully reduced the FNR of the ALLIUM RNA-seq ALL subtype classifier from 8.95% to 3.5%. For patients whose subtype was not previously known, the use of ALLCoP was able to reduce the occurrence of empty predictions from 37% to 17%. Notably, up to 34% of the multiple-class prediction sets included the alt subtype, suggesting that increased prediction set size may reflect secondary aberrations and biological complexity, contributing to classifier uncertainty. Finally, ALLCoP was validated on two additional RNA-seq ALL subtype classifiers, ALLSorts and ALLCatchR.
Our results highlight the potential of CP in enhancing the use of oncologic RNA-seq subtyping classifiers and also in uncovering additional molecular aberrations of potential clinical importance.
机器学习的最新进展促使了一些分类器的开发,这些分类器利用RNA测序(RNA-seq)数据预测急性淋巴细胞白血病(ALL)的分子亚型。尽管这些模型已显示出有前景的结果,但它们往往缺乏可靠的性能保证。本研究的目的有三个:量化这些分类器的不确定性,提供控制假阴性率(FNR)的预测集,并通过将错误预测转化为不确定预测来进行隐式错误减少。
共形预测(CP)是一个与分布无关的框架,用于生成统计校准的预测集,其大小反映模型不确定性。在本研究中,我们将一种称为共形风险控制的扩展应用于三个RNA-seq ALL亚型分类器。利用来自1227例诊断时采集的患者样本的RNA-seq数据,我们开发了一个多类共形预测器ALLCoP,它生成具有统计保证的FNR控制预测集。
ALLCoP能够创建具有指定FNR容差范围从7.5%到30%的预测集。在一个验证队列中,ALLCoP成功地将ALLIUM RNA-seq ALL亚型分类器的FNR从8.95%降低到3.5%。对于亚型先前未知的患者,使用ALLCoP能够将空预测的发生率从37%降低到17%。值得注意的是,高达34%的多类预测集包含替代亚型,这表明预测集大小的增加可能反映了继发性畸变和生物学复杂性,导致分类器的不确定性。最后,ALLCoP在另外两个RNA-seq ALL亚型分类器ALLSorts和ALLCatchR上得到了验证。
我们的结果突出了CP在增强肿瘤RNA-seq亚型分类器应用方面的潜力,以及在揭示具有潜在临床重要性的其他分子畸变方面的潜力。