Bhandari Nikita, Walambe Rahee, Kotecha Ketan, Khare Satyajeet P
Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India.
Electronics and Telecommunication Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India.
Front Mol Biosci. 2022 Nov 7;9:907150. doi: 10.3389/fmolb.2022.907150. eCollection 2022.
Computational analysis methods including machine learning have a significant impact in the fields of genomics and medicine. High-throughput gene expression analysis methods such as microarray technology and RNA sequencing produce enormous amounts of data. Traditionally, statistical methods are used for comparative analysis of gene expression data. However, more complex analysis for classification of sample observations, or discovery of feature genes requires sophisticated computational approaches. In this review, we compile various statistical and computational tools used in analysis of expression microarray data. Even though the methods are discussed in the context of expression microarrays, they can also be applied for the analysis of RNA sequencing and quantitative proteomics datasets. We discuss the types of missing values, and the methods and approaches usually employed in their imputation. We also discuss methods of data normalization, feature selection, and feature extraction. Lastly, methods of classification and class discovery along with their evaluation parameters are described in detail. We believe that this detailed review will help the users to select appropriate methods for preprocessing and analysis of their data based on the expected outcome.
包括机器学习在内的计算分析方法在基因组学和医学领域有着重大影响。诸如微阵列技术和RNA测序等高通量基因表达分析方法会产生海量数据。传统上,统计方法用于基因表达数据的比较分析。然而,对样本观测值进行分类或发现特征基因的更复杂分析需要复杂的计算方法。在这篇综述中,我们汇编了用于分析表达微阵列数据的各种统计和计算工具。尽管这些方法是在表达微阵列的背景下进行讨论的,但它们也可应用于RNA测序和定量蛋白质组学数据集的分析。我们讨论了缺失值的类型,以及在其插补过程中通常采用的方法和途径。我们还讨论了数据归一化、特征选择和特征提取的方法。最后,详细描述了分类和类发现方法及其评估参数。我们相信,这篇详细的综述将帮助用户根据预期结果选择合适的方法对其数据进行预处理和分析。