使用共形预测减少白血病机器学习分类中的误差

Error Reduction in Leukemia Machine Learning Classification With Conformal Prediction.

作者信息

Lysenkova Wiklander Mariya, Zachariah Dave, Krali Olga, Nordlund Jessica

机构信息

Department of Medical Sciences, Uppsala University, Uppsala, Sweden.

SciLifeLab, Uppsala University, Uppsala, Sweden.

出版信息

JCO Clin Cancer Inform. 2025 May;9:e2400324. doi: 10.1200/CCI-24-00324. Epub 2025 May 28.

DOI:10.1200/CCI-24-00324

PMID:40435436

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12133051/

Abstract

PURPOSE

Recent advances in machine learning have led to the development of classifiers that predict molecular subtypes of acute lymphoblastic leukemia (ALL) using RNA-sequencing (RNA-seq) data. Although these models have shown promising results, they often lack robust performance guarantees. The aim of this study was three-fold: to quantify the uncertainty of these classifiers, to provide prediction sets that control the false-negative rate (FNR), and to perform implicit error reduction by transforming incorrect predictions into uncertain predictions.

METHODS

Conformal prediction (CP) is a distribution-agnostic framework for generating statistically calibrated prediction sets whose size reflects model uncertainty. In this study, we applied an extension called conformal risk control to three RNA-seq ALL subtype classifiers. Leveraging RNA-seq data from 1,227 patient samples taken at diagnosis, we developed a multiclass conformal predictor ALLCoP, which generates statistically guaranteed FNR-controlled prediction sets.

RESULTS

ALLCoP was able to create prediction sets with specified FNR tolerances ranging from 7.5% to 30%. In a validation cohort, ALLCoP successfully reduced the FNR of the ALLIUM RNA-seq ALL subtype classifier from 8.95% to 3.5%. For patients whose subtype was not previously known, the use of ALLCoP was able to reduce the occurrence of empty predictions from 37% to 17%. Notably, up to 34% of the multiple-class prediction sets included the alt subtype, suggesting that increased prediction set size may reflect secondary aberrations and biological complexity, contributing to classifier uncertainty. Finally, ALLCoP was validated on two additional RNA-seq ALL subtype classifiers, ALLSorts and ALLCatchR.

CONCLUSION

Our results highlight the potential of CP in enhancing the use of oncologic RNA-seq subtyping classifiers and also in uncovering additional molecular aberrations of potential clinical importance.

摘要

目的

机器学习的最新进展促使了一些分类器的开发，这些分类器利用RNA测序（RNA-seq）数据预测急性淋巴细胞白血病（ALL）的分子亚型。尽管这些模型已显示出有前景的结果，但它们往往缺乏可靠的性能保证。本研究的目的有三个：量化这些分类器的不确定性，提供控制假阴性率（FNR）的预测集，并通过将错误预测转化为不确定预测来进行隐式错误减少。

方法

共形预测（CP）是一个与分布无关的框架，用于生成统计校准的预测集，其大小反映模型不确定性。在本研究中，我们将一种称为共形风险控制的扩展应用于三个RNA-seq ALL亚型分类器。利用来自1227例诊断时采集的患者样本的RNA-seq数据，我们开发了一个多类共形预测器ALLCoP，它生成具有统计保证的FNR控制预测集。

结果

ALLCoP能够创建具有指定FNR容差范围从7.5%到30%的预测集。在一个验证队列中，ALLCoP成功地将ALLIUM RNA-seq ALL亚型分类器的FNR从8.95%降低到3.5%。对于亚型先前未知的患者，使用ALLCoP能够将空预测的发生率从37%降低到17%。值得注意的是，高达34%的多类预测集包含替代亚型，这表明预测集大小的增加可能反映了继发性畸变和生物学复杂性，导致分类器的不确定性。最后，ALLCoP在另外两个RNA-seq ALL亚型分类器ALLSorts和ALLCatchR上得到了验证。