School of Systems Biomedical Science, Soongsil University, Seoul, Korea.
Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea.
Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab163.
Bulk tumor samples used for high-throughput molecular profiling are often an admixture of cancer cells and non-cancerous cells, which include immune and stromal cells. The mixed composition can confound the analysis and affect the biological interpretation of the results, and thus, accurate prediction of tumor purity is critical. Although several methods have been proposed to predict tumor purity using high-throughput molecular data, there has been no comprehensive study on machine learning-based methods for the estimation of tumor purity.
We applied various machine learning models to estimate tumor purity. Overall, the models predicted the tumor purity accurately and showed a high correlation with well-established gold standard methods. In addition, we identified a small group of genes and demonstrated that they could predict tumor purity well. Finally, we confirmed that these genes were mainly involved in the immune system.
The machine learning models constructed for this study are available at https://github.com/BonilKoo/ML_purity.
用于高通量分子分析的大量肿瘤样本通常是癌细胞和非癌细胞的混合物,其中包括免疫细胞和基质细胞。这种混合成分可能会混淆分析并影响结果的生物学解释,因此,准确预测肿瘤纯度至关重要。尽管已经提出了几种使用高通量分子数据预测肿瘤纯度的方法,但对于基于机器学习的肿瘤纯度估计方法尚未进行全面研究。
我们应用了各种机器学习模型来估计肿瘤纯度。总体而言,这些模型能够准确地预测肿瘤纯度,并与成熟的金标准方法高度相关。此外,我们确定了一小部分基因,并证明它们可以很好地预测肿瘤纯度。最后,我们证实这些基因主要参与免疫系统。
为这项研究构建的机器学习模型可在 https://github.com/BonilKoo/ML_purity 上获得。