Cattelani Luca, Fortino Vittorio
Institute of Biomedicine, School of Medicine, University of Eastern Finland, Kuopio, Finland.
Bioinform Adv. 2022 Oct 13;2(1):vbac074. doi: 10.1093/bioadv/vbac074. eCollection 2022.
Gene expression-based classifiers are often developed using historical data by training a model on a small set of patients and a large set of features. Models trained in such a way can be afterwards applied for predicting the output for new unseen patient data. However, very often the accuracy of these models starts to decrease as soon as new data is fed into the trained model. This problem, known as concept drift, complicates the task of learning efficient biomarkers from data and requires special approaches, different from commonly used data mining techniques.
Here, we propose an online ensemble learning method to continually validate and adjust gene expression-based biomarker panels over increasing volume of data. We also propose a computational solution to the problem of feature drift where gene expression signatures used to train the classifier become less relevant over time. A benchmark study was conducted to classify the breast tumors into known subtypes by using a large-scale transcriptomic dataset (∼3500 patients), which was obtained by combining two datasets: SCAN-B and TCGA-BRCA. Remarkably, the proposed strategy improves the classification performances of gold-standard biomarker panels (e.g. PAM50, OncotypeDX and Endopredict) by adding features that are clinically relevant. Moreover, test results show that newly discovered biomarker models can retain a high classification accuracy rate when changing the source generating the gene expression profiles.
github.com/UEFBiomedicalInformaticsLab/OnlineLearningBD.
Supplementary data are available at online.
基于基因表达的分类器通常使用历史数据来开发,即在一小部分患者和大量特征上训练模型。以这种方式训练的模型随后可用于预测新的未见患者数据的输出。然而,一旦将新数据输入到训练好的模型中,这些模型的准确性往往就会开始下降。这个问题被称为概念漂移,它使从数据中学习有效生物标志物的任务变得复杂,并且需要不同于常用数据挖掘技术的特殊方法。
在这里,我们提出了一种在线集成学习方法,以随着数据量的增加不断验证和调整基于基因表达的生物标志物面板。我们还提出了一种针对特征漂移问题的计算解决方案,即用于训练分类器的基因表达特征随着时间的推移变得不那么相关。通过使用大规模转录组数据集(约3500名患者)进行了一项基准研究,该数据集是通过合并两个数据集:SCAN - B和TCGA - BRCA获得的,将乳腺肿瘤分类为已知亚型。值得注意的是,所提出的策略通过添加与临床相关的特征提高了金标准生物标志物面板(例如PAM50、OncotypeDX和Endopredict)的分类性能。此外,测试结果表明,当改变生成基因表达谱的来源时,新发现的生物标志物模型可以保持较高的分类准确率。
github.com/UEFBiomedicalInformaticsLab/OnlineLearningBD。
补充数据可在网上获取。