Mahendran Nivedhitha, Durai Raj Vincent P M, Srinivasan Kathiravan, Chang Chuan-Yu
School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India.
Department of Computer Science and Information Engineering, National Yunlin University of Science and Technology, Douliu, Taiwan.
Front Genet. 2020 Dec 10;11:603808. doi: 10.3389/fgene.2020.603808. eCollection 2020.
Gene Expression is the process of determining the physical characteristics of living beings by generating the necessary proteins. Gene Expression takes place in two steps, translation and transcription. It is the flow of information from DNA to RNA with enzymes' help, and the end product is proteins and other biochemical molecules. Many technologies can capture Gene Expression from the DNA or RNA. One such technique is Microarray DNA. Other than being expensive, the main issue with Microarray DNA is that it generates high-dimensional data with minimal sample size. The issue in handling such a heavyweight dataset is that the learning model will be over-fitted. This problem should be addressed by reducing the dimension of the data source to a considerable amount. In recent years, Machine Learning has gained popularity in the field of genomic studies. In the literature, many Machine Learning-based Gene Selection approaches have been discussed, which were proposed to improve dimensionality reduction precision. This paper does an extensive review of the various works done on Machine Learning-based gene selection in recent years, along with its performance analysis. The study categorizes various feature selection algorithms under Supervised, Unsupervised, and Semi-supervised learning. The works done in recent years to reduce the features for diagnosing tumors are discussed in detail. Furthermore, the performance of several discussed methods in the literature is analyzed. This study also lists out and briefly discusses the open issues in handling the high-dimension and less sample size data.
基因表达是通过生成必要的蛋白质来确定生物物理特征的过程。基因表达分两步进行,即翻译和转录。它是在酶的帮助下信息从DNA流向RNA的过程,最终产物是蛋白质和其他生物化学分子。许多技术可以从DNA或RNA中捕获基因表达。微阵列DNA就是这样一种技术。除了成本高昂外,微阵列DNA的主要问题是它以最小的样本量生成高维数据。处理如此庞大的数据集的问题在于学习模型会过度拟合。这个问题应该通过将数据源的维度大幅降低来解决。近年来,机器学习在基因组研究领域颇受欢迎。在文献中,已经讨论了许多基于机器学习的基因选择方法,这些方法旨在提高降维精度。本文对近年来基于机器学习的基因选择所做的各种工作进行了广泛综述,并对其性能进行了分析。该研究将各种特征选择算法分为监督学习、无监督学习和半监督学习。详细讨论了近年来为减少肿瘤诊断特征所做的工作。此外,还分析了文献中几种讨论方法的性能。本研究还列出并简要讨论了处理高维和小样本量数据时的开放问题。