School of Mathematical Sciences, Xiamen University, Xiamen, 361005, Fujian, China.
BMC Bioinformatics. 2024 May 9;25(1):183. doi: 10.1186/s12859-024-05800-y.
In recent years, gene clustering analysis has become a widely used tool for studying gene functions, efficiently categorizing genes with similar expression patterns to aid in identifying gene functions. Caenorhabditis elegans is commonly used in embryonic research due to its consistent cell lineage from fertilized egg to adulthood. Biologists use 4D confocal imaging to observe gene expression dynamics at the single-cell level. However, on one hand, the observed tree-shaped time-series datasets have characteristics such as non-pairwise data points between different individuals. On the other hand, the influence of cell type heterogeneity should also be considered during clustering, aiming to obtain more biologically significant clustering results.
A biclustering model is proposed for tree-shaped single-cell gene expression data of Caenorhabditis elegans. Detailedly, a tree-shaped piecewise polynomial function is first employed to fit non-pairwise gene expression time series data. Then, four factors are considered in the objective function, including Pearson correlation coefficients capturing gene correlations, p-values from the Kolmogorov-Smirnov test measuring the similarity between cells, as well as gene expression size and bicluster overlapping size. After that, Genetic Algorithm is utilized to optimize the function.
The results on the small-scale dataset analysis validate the feasibility and effectiveness of our model and are superior to existing classical biclustering models. Besides, gene enrichment analysis is employed to assess the results on the complete real dataset analysis, confirming that the discovered biclustering results hold significant biological relevance.
近年来,基因聚类分析已成为研究基因功能的一种广泛使用的工具,它可以有效地对具有相似表达模式的基因进行分类,以帮助识别基因功能。秀丽隐杆线虫由于其从受精卵到成年的一致细胞谱系,常用于胚胎研究。生物学家使用 4D 共聚焦成像来观察单细胞水平的基因表达动态。然而,一方面,观察到的树状时间序列数据集具有非成对数据点在不同个体之间的特征。另一方面,在聚类过程中还应考虑细胞类型异质性的影响,旨在获得更具生物学意义的聚类结果。
针对秀丽隐杆线虫的树状单细胞基因表达数据提出了一种双聚类模型。详细地,首先使用树状分段多项式函数拟合非成对基因表达时间序列数据。然后,在目标函数中考虑了四个因素,包括捕捉基因相关性的 Pearson 相关系数、用于测量细胞间相似性的 Kolmogorov-Smirnov 检验的 p 值,以及基因表达大小和双聚类重叠大小。然后,使用遗传算法来优化函数。
小规模数据集分析的结果验证了我们模型的可行性和有效性,优于现有的经典双聚类模型。此外,在完整的真实数据集分析中进行了基因富集分析,证实了发现的双聚类结果具有重要的生物学相关性。