Faculty of Mathematics and Informatics, Sofia University "St. Kliment Ohridski", 125 Tsarigradsko Shosse Blvd., bl. 2, 1113 Sofia, Bulgaria.
Genes (Basel). 2023 Feb 4;14(2):412. doi: 10.3390/genes14020412.
For several decades, intensive research for understanding gene activity and its role in organism's lives is the research focus of scientists in different areas. A part of these investigations is the analysis of gene expression data for selecting differentially expressed genes. Methods that identify the interested genes have been proposed on statistical data analysis. The problem is that there is no good agreement among them, as different results are produced by distinct methods. By taking the advantage of the unsupervised data analysis, an iterative clustering procedure that finds differentially expressed genes shows promising results. In the present paper, a comparative study of the clustering methods applied for gene expression analysis is presented to explicate the choice of the clustering algorithm implemented in the method. An investigation of different distance measures is provided to reveal those that increase the efficiency of the method in finding the real data structure. Further, the method is improved by incorporating an additional aggregation measure based on the standard deviation of the expression levels. Its usage increases the gene distinction as a new amount of differentially expressed genes is found. The method is summarized in a detailed procedure. The significance of the method is proved by an analysis of two mice strain data sets. The differentially expressed genes defined by the proposed method are compared with those selected by the well-known statistical methods applied to the same data set.
几十年来,深入研究基因活性及其在生物体生命中的作用一直是不同领域科学家的研究重点。这些研究的一部分是分析基因表达数据以选择差异表达基因。基于统计数据分析的识别感兴趣基因的方法已经被提出。问题在于它们之间没有很好的一致性,因为不同的方法会产生不同的结果。通过利用无监督数据分析,一种发现差异表达基因的迭代聚类过程显示出了有希望的结果。在本文中,对应用于基因表达分析的聚类方法进行了比较研究,以阐明所实现的聚类算法的选择。提供了对不同距离度量的研究,以揭示那些能够提高该方法在发现真实数据结构方面的效率的度量。此外,通过基于表达水平的标准差的附加聚合度量来改进该方法。发现了新的差异表达基因,从而增加了基因的区分度。该方法总结在一个详细的过程中。通过对两个老鼠品系数据集的分析,证明了该方法的有效性。所提出的方法定义的差异表达基因与应用于同一数据集的著名统计方法选择的基因进行了比较。