Division of Oncology, Department of Clinical Sciences, Lund University, Lund, Sweden.
Urology - urothelial cancer, Department of Translational Medicine, Lund University, Skåne University Hospital, Malmö, Sweden.
Bioinformatics. 2022 Jan 27;38(4):1022-1029. doi: 10.1093/bioinformatics/btab763.
Gene expression-based multiclass prediction, such as tumor subtyping, is a non-trivial bioinformatic problem. Most classifier methods operate by comparing expression levels relative to other samples. Methods that base predictions on the expression pattern within a sample have been proposed as an alternative. As these methods are invariant to the cohort composition and can be applied to a sample in isolation, they can collectively be termed single sample predictors (SSP). Such predictors could potentially be used for preprocessing-free classification of new samples and be built to function across different expression platforms where proper batch and dataset normalization is challenging. Here, we evaluate the behavior of several multiclass SSPs based on binary gene-pair rules (k-Top Scoring Pairs, Absolute Intrinsic Molecular Subtyping and a new Random Forest approach) and compare them to centroids built with centered or raw expression values, with the criteria that an optimal predictor should have high accuracy, overcome differences in tumor purity, be robust across expression platforms and provide an informative prediction output score.
We found that gene-pair-based SSPs showed excellent performance on many expression-based classification tasks. The three methods differed in prediction score output, handling of tied scores and behavior in low purity samples. The k-Top Scoring Pairs and Random Forest approach both achieved high classification accuracy while providing an informative prediction score. Although gene-pair-based SSPs have been touted as being cross-platform compatible (through training on mixed platform data), out-of-the-box compatibility with a new dataset remains a potential issue that warrants cohort-to-cohort verification.
Our R package 'multiclassPairs' (https://cran.r-project.org/package=multiclassPairs) (https://doi.org/10.1093/bioinformatics/btab088) is freely available and enables easy training, prediction, and visualization using the gene-pair rule-based Random Forest SSP method and provides additional multiclass functionalities to the switchBox k-Top-Scoring Pairs package.
Supplementary data are available at Bioinformatics online.
基于基因表达的多类预测,如肿瘤亚型分类,是一个具有挑战性的生物信息学问题。大多数分类器方法通过比较相对于其他样本的表达水平来进行操作。已经提出了基于样本内表达模式进行预测的方法作为替代方法。由于这些方法对队列组成不变,并且可以应用于孤立的样本,因此可以统称为单样本预测器(SSP)。这些预测器有可能用于新样本的无预处理分类,并构建为在适当的批次和数据集归一化具有挑战性的不同表达平台上运行。在这里,我们评估了几种基于二元基因对规则的多类 SSP(k-最佳评分对、绝对内在分子分型和新的随机森林方法)的行为,并将它们与基于中心化或原始表达值构建的质心进行比较,其标准是最佳预测器应该具有高精度、克服肿瘤纯度差异、在不同表达平台上稳健且提供有信息量的预测输出评分。
我们发现,基于基因对的 SSP 在许多基于表达的分类任务中表现出色。这三种方法在预测评分输出、平局分数的处理以及低纯度样本中的行为方面存在差异。k-最佳评分对和随机森林方法都实现了高精度分类,同时提供了有信息量的预测评分。尽管基于基因对的 SSP 被吹捧为跨平台兼容(通过混合平台数据进行训练),但与新数据集的开箱即用兼容性仍然是一个潜在问题,需要进行队列间验证。
我们的 R 包“multiclassPairs”(https://cran.r-project.org/package=multiclassPairs)(https://doi.org/10.1093/bioinformatics/btab088)可免费获得,可轻松使用基于基因对规则的随机森林 SSP 方法进行训练、预测和可视化,并为 switchBox k-最佳评分对包提供额外的多类功能。
补充数据可在生物信息学在线获得。