Machicao Jeaneth, Craighero Francesco, Maspero Davide, Angaroni Fabrizio, Damiani Chiara, Graudenzi Alex, Antoniotti Marco, Bruno Odemir M
1São Carlos Institute of Physics, University of São Paulo, São Carlos, Brazil; 2School of Engineering, University of São Paulo, São Paulo, Brazil; 3Department of Informatics, Systems and Communication, University of Milan-Bicocca, Milan, Italy; 4Institute of Molecular Bioimaging and Physiology, Consiglio Nazionale delle Ricerche (IBFM-CNR), Segrate, Milan, Italy; 5Department of Biotechnology and Biosciences, University of Milan-Bicocca, Milan, Italy; 6Sysbio Centre for Systems Biology, Milan, Italy; 7Bicocca Bioinformatics, Biostatistics and Bioimaging Center (B4), University of Milan-Bicocca, Milan, Italy.
Curr Genomics. 2021 Feb;22(2):88-97. doi: 10.2174/1389202922666210301084151.
The increasing availability of omics data collected from patients affected by severe pathologies, such as cancer, is fostering the development of data science methods for their analysis.
The combination of data integration and machine learning approaches can provide new powerful instruments to tackle the complexity of cancer development and deliver effective diagnostic and prognostic strategies.
We explore the possibility of exploiting the topological properties of sample-specific metabolic networks as features in a supervised classification task. Such networks are obtained by projecting transcriptomic data from RNA-seq experiments on genome-wide metabolic models to define weighted networks modeling the overall metabolic activity of a given sample.
We show the classification results on a labeled breast cancer dataset from the TCGA database, including 210 samples (cancer . normal). In particular, we investigate how the performance is affected by a threshold-based pruning of the networks by comparing Artificial Neural Networks, Support Vector Machines and Random Forests. Interestingly, the best classification performance is achieved within a small threshold range for all methods, suggesting that it might represent an effective choice to recover useful information while filtering out noise from data. Overall, the best accuracy is achieved with SVMs, which exhibit performances similar to those obtained when gene expression profiles are used as features.
These findings demonstrate that the topological properties of sample-specific metabolic networks are effective in classifying cancer and normal samples, suggesting that useful information can be extracted from a relatively limited number of features.
从癌症等严重疾病患者身上收集的组学数据越来越多,这促进了用于分析这些数据的数据科学方法的发展。
数据整合和机器学习方法的结合可以提供新的强大工具,以应对癌症发展的复杂性,并提供有效的诊断和预后策略。
我们探索了在监督分类任务中利用样本特异性代谢网络的拓扑特性作为特征的可能性。此类网络是通过将RNA测序实验的转录组数据投影到全基因组代谢模型上获得的,以定义对给定样本的整体代谢活性进行建模的加权网络。
我们展示了来自TCGA数据库的一个标记乳腺癌数据集(包括210个样本,癌症样本.正常样本)的分类结果。特别是,我们通过比较人工神经网络、支持向量机和随机森林,研究了基于阈值的网络剪枝如何影响性能。有趣的是,所有方法在一个小的阈值范围内都能实现最佳分类性能,这表明它可能是在从数据中滤除噪声的同时恢复有用信息的有效选择。总体而言,支持向量机实现了最佳准确率,其表现与使用基因表达谱作为特征时获得的表现相似。
这些发现表明,样本特异性代谢网络的拓扑特性在对癌症和正常样本进行分类方面是有效的,这表明可以从相对有限数量的特征中提取有用信息。