Dag Osman, Kasikci Merve, Ilk Ozlem, Yesiltepe Metin
Department of Biostatistics, School of Medicine, Hacettepe University, 06100, Sihhiye, Ankara, Turkey.
Department of Statistics, Faculty of Arts and Science, Middle East Technical University, 06800, Cankaya, Ankara, Turkey.
Med Biol Eng Comput. 2023 Jan;61(1):229-241. doi: 10.1007/s11517-022-02695-w. Epub 2022 Nov 10.
Selection of differentially expressed genes (DEGs) is a vital process to discover the causes of diseases. It has been shown that modelling of genomics data by considering relation among genes increases the predictive performance of methods compared to univariate analysis. However, there exist serious differences among most studies analyzing the same dataset for the reasons arising from the methods. Therefore, there is a strong need for easily accessible, user-friendly, and interactive tool to perform gene selection for RNA-seq data via machine learning algorithms simultaneously not to miss DEGs. We develop an open-source and freely available web-based tool for gene selection via machine learning algorithms that can deal with high performance computation. This tool includes six machine learning algorithms having different aspects. Moreover, the tool involves classical pre-processing steps; filtering, normalization, transformation, and univariate analysis. It also offers well-arranged graphical approaches; network plot, heatmap, venn diagram, and box-and-whisker plot. Gene ontology analysis is provided for both mRNA and miRNA DEGs. The implementation is carried out on Alzheimer RNA-seq data to demonstrate the use of this web-based tool. Eleven genes are suggested by at least two out of six methods. One of these genes, hsa-miR-148a-3p, might be considered as a new biomarker for Alzheimer's disease diagnosis. Kidney Chromophobe dataset is also analyzed to demonstrate the validity of GeneSelectML web tool on a different dataset. GeneSelectML is distinguished in that it simultaneously uses different machine learning algorithms for gene selection and can perform pre-processing, graphical representation, and gene ontology analyses on the same tool. This tool is freely available at www.softmed.hacettepe.edu.tr/GeneSelectML .
差异表达基因(DEG)的选择是发现疾病病因的关键过程。研究表明,与单变量分析相比,通过考虑基因间关系对基因组数据进行建模可提高方法的预测性能。然而,由于方法的原因,大多数分析同一数据集的研究之间存在严重差异。因此,迫切需要一种易于访问、用户友好且交互式的工具,通过机器学习算法同时对RNA-seq数据进行基因选择,以免遗漏差异表达基因。我们开发了一种基于网络的开源免费工具,用于通过机器学习算法进行基因选择,该算法可处理高性能计算。此工具包括六种具有不同方面的机器学习算法。此外,该工具还涉及经典的预处理步骤;过滤、归一化、转换和单变量分析。它还提供了精心安排的图形方法;网络图、热图、维恩图和箱线图。为mRNA和miRNA差异表达基因都提供了基因本体分析。在阿尔茨海默病RNA-seq数据上进行了实现,以演示此基于网络工具的使用。六种方法中至少有两种方法推荐了11个基因。其中一个基因,hsa-miR-148a-3p,可能被视为阿尔茨海默病诊断的新生物标志物。还分析了肾嫌色细胞数据集,以证明GeneSelectML网络工具在不同数据集上的有效性。GeneSelectML的独特之处在于它同时使用不同的机器学习算法进行基因选择,并且可以在同一工具上执行预处理、图形表示和基因本体分析。该工具可在www.softmed.hacettepe.edu.tr/GeneSelectML上免费获取。