Cordier Chiara, Jézéquel Pascal, Campone Mario, Panloup Fabien, Basseville Agnes
LAREMA, Univ Angers, CNRS, SFR MATHSTIC, Angers F-49000, France.
Institut de Cancérologie de l'Ouest, Angers F-49000, France.
Bioinformatics. 2025 Jun 2;41(6). doi: 10.1093/bioinformatics/btaf310.
Machine learning analyses of molecular omics datasets largely drive the development of precision medicine in oncology, but mathematical challenges still hamper their application in the clinic. In particular, omics-based learning relies on high dimensional data with high degrees of freedom and multicollinearity issues, requiring more tailored algorithms. Here, we have developed a prediction algorithm that relies on the 1-Wasserstein distance to better capture complex relationships between variables, and that is built on a decision rule based on the exact computation of the Kantorovich-Rubinstein optimizer to increase the algorithm precision. We explored dimension reduction and aggregation methods to improve its robustness. The exact method was compared with a neural network-based approximate method, as well as with standard Euclidean distance-based classifiers.
Experimental results on synthetic datasets with multiple scenarios of redundant/informative variables revealed that exact and approximate methods based on Wasserstein distance outperformed state-of-the-art algorithms when class information was spread across a large number of variables. When predicting clinical or biological outcomes from transcriptomics datasets, HABiC achieved consistently higher accuracy in most situations.
Python code for the HABiC classifier is available at https://github.com/chiaraco/HABiC.
分子组学数据集的机器学习分析在很大程度上推动了肿瘤学精准医学的发展,但数学挑战仍然阻碍了它们在临床中的应用。特别是,基于组学的学习依赖于具有高自由度和多重共线性问题的高维数据,需要更具针对性的算法。在此,我们开发了一种预测算法,该算法依赖于1-瓦瑟斯坦距离来更好地捕捉变量之间的复杂关系,并且基于基于康托罗维奇-鲁宾斯坦优化器精确计算的决策规则构建,以提高算法精度。我们探索了降维和聚合方法以提高其鲁棒性。将精确方法与基于神经网络的近似方法以及基于标准欧几里得距离的分类器进行了比较。
在具有多种冗余/信息变量场景的合成数据集上的实验结果表明,当类别信息分布在大量变量中时,基于瓦瑟斯坦距离的精确和近似方法优于现有算法。当从转录组学数据集预测临床或生物学结果时,HABiC在大多数情况下始终具有更高的准确性。
HABiC分类器的Python代码可在https://github.com/chiaraco/HABiC获得。