迈向双变量单调分类器的全基因组规模发现。

Towards the genome-scale discovery of bivariate monotonic classifiers.

作者信息

Fourquet Océane, Krejca Martin S, Doerr Carola, Schwikowski Benno

机构信息

Computational Systems Biomedicine Lab, Institut Pasteur, Université Paris Cité, 25-28 Rue du Dr Roux, 75015, Paris, France.

LIP6, CNRS, Sorbonne Université, 4 Place Jussieu, 75005, Paris, France.

出版信息

BMC Bioinformatics. 2025 Sep 2;26(1):228. doi: 10.1186/s12859-025-06253-7.

DOI:10.1186/s12859-025-06253-7

PMID:40898061

Abstract

BACKGROUND

Bivariate monotonic classifiers (BMCs) are based on pairs of input features. Like many other models used for machine learning, they can capture nonlinear patterns in high-dimensional data. At the same time, they are simple and easy to interpret. Until now, the use of BMCs on a genome scale was hampered by the high computational complexity of the search for pairs of features with a high leave-one-out performance estimate.

RESULTS

We introduce the fastBMC algorithm, which drastically speeds up the identification of BMCs. The algorithm is based on a mathematical bound for the BMC performance estimate while maintaining optimality. We show empirically that fastBMC speeds up the computation by a factor of at least 15 already for a small number of features, compared to the traditional approach. For two of the three smaller biomedical datasets that we consider here, the resulting possibility of considering much larger sets of features translates into significantly improved classification performance. As an example of the high degree of interpretability of BMCs, we discuss a straightforward interpretation of a BMC glioblastoma survival predictor, an immediate novel biomedical hypothesis, options for biomedical validation, and treatment implications. In addition, we study the performance of fastBMC on a larger and well-known breast cancer dataset, validating the benefits of the BMCs for biomarker identification and biomedical hypothesis generation.

CONCLUSION

fastBMC enables the rapid construction of robust and interpretable ensemble models using BMC, facilitating the discovery of gene pairs predictive of relevant phenotypes and their interaction in that context.

AVAILABILITY

We provide the first open-source implementation for learning BMCs, a Python implementation of fastBMC in particular, and Python code to reproduce the fastBMC results on real and simulated data in this paper, at https://github.com/oceanefrqt/fastBMC .

摘要

背景

双变量单调分类器（BMC）基于输入特征对。与许多用于机器学习的其他模型一样，它们可以捕捉高维数据中的非线性模式。同时，它们简单且易于解释。到目前为止，在基因组规模上使用BMC受到寻找具有高留一法性能估计的特征对时的高计算复杂性的阻碍。

结果

我们引入了fastBMC算法，该算法极大地加快了BMC的识别速度。该算法基于BMC性能估计的数学界限，同时保持最优性。我们通过实验表明，与传统方法相比，对于少量特征，fastBMC已经将计算速度提高了至少15倍。对于我们在此考虑的三个较小生物医学数据集中的两个，考虑更大特征集的可能性转化为显著提高的分类性能。作为BMC高度可解释性的一个例子，我们讨论了BMC胶质母细胞瘤生存预测器的直接解释、一个直接的新型生物医学假设、生物医学验证的选项以及治疗意义。此外，我们研究了fastBMC在一个更大且知名的乳腺癌数据集上的性能，验证了BMC在生物标志物识别和生物医学假设生成方面的优势。