Suppr超能文献

PanClassif:使用机器学习改进单细胞RNA测序基因表达数据的泛癌分类

PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning.

作者信息

Mahin Kazi Ferdous, Robiuddin Md, Islam Mujahidul, Ashraf Shayed, Yeasmin Farjana, Shatabda Swakkhar

机构信息

Department of Computer Science and Engineering, United International University, Plot-2, United City, Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh.

Department of Computer Science and Engineering, United International University, Plot-2, United City, Madani Avenue, Satarkul, Badda, Dhaka 1212, Bangladesh.

出版信息

Genomics. 2022 Mar;114(2):110264. doi: 10.1016/j.ygeno.2022.01.001. Epub 2022 Jan 6.

Abstract

Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cancer and related treatments are coming into light. In this paper, we propose PanClassif, a method that requires a very few and effective genes to detect cancer from RNA-seq data and is able to provide performance gain in several wide range machine learning classifiers. We have taken 22 types of cancer samples from The Cancer Genome Atlas (TCGA) having 8287 cancer samples and 680 normal samples. Firstly, PanClassif uses k-Nearest Neighbour (k-NN) smoothing to smooth the samples to handle noise in the data. Then effective genes are selected by Anova based test. For balancing the train data, PanClassif applies an oversampling method, SMOTE. We have performed comprehensive experiments on the datasets using several classification algorithms. Experimental results shows that PanClassif outperform existing state-of-the-art methods available and shows consistent performance for two single cell RNA-seq datasets taken from Gene Expression Omnibus (GEO). PanClassif improves performances of a wide variety of classifiers for both binary cancer prediction and multi-class cancer classification. PanClassif is available as a python package (https://pypi.org/project/panclassif/). All the source code and materials of PanClassif are available at https://github.com/Zwei-inc/panclassif.

摘要

癌症是每年导致人类死亡的主要原因之一。近年来,由于高通量测序数据的可用性,利用机器学习进行癌症识别和分类的研究得到了快速发展。通过RNA测序,癌症研究日益蓬勃,关于癌症及相关治疗的新见解不断涌现。在本文中,我们提出了PanClassif方法,该方法只需极少且有效的基因就能从RNA测序数据中检测癌症,并且能够在多种广泛使用的机器学习分类器中提升性能。我们从癌症基因组图谱(TCGA)中选取了22种癌症样本,其中有8287个癌症样本和680个正常样本。首先,PanClassif使用k近邻(k-NN)平滑法对样本进行平滑处理,以应对数据中的噪声。然后通过基于方差分析的测试来选择有效基因。为了平衡训练数据,PanClassif应用了一种过采样方法——合成少数类过采样技术(SMOTE)。我们使用多种分类算法对数据集进行了全面实验。实验结果表明,PanClassif优于现有的最先进方法,并且在取自基因表达综合数据库(GEO)的两个单细胞RNA测序数据集上表现出一致的性能。PanClassif在二元癌症预测和多类癌症分类方面均提升了多种分类器的性能。PanClassif作为一个Python包(https://pypi.org/project/panclassif/)可供使用。PanClassif的所有源代码和材料可在https://github.com/Zwei-inc/panclassif获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验