Kenarangi Taiebe, Bakhshi Enayatolah, InanlooRahatloo Kolsoum, Biglarian Akbar
Department of Biostatistics and Epidemiology, University of Social Welfare and Rehabilitation Sciences, Tehran, Iran.
Department of cell and molecular biology, school of biology, college of science, university of Tehran, Tehran, Iran.
Gastroenterol Hepatol Bed Bench. 2022;15(4):387-394. doi: 10.22037/ghfbb.v15i4.2488.
This study aimed to detect gene signatures in RNA-sequencing (RNA-seq) data using Pareto-optimal cluster size identification.
RNA-seq has emerged as an important technology for transcriptome profiling in recent years. Gene expression signatures involving tens of genes have been proven to be predictive of disease type and patient response to treatment.
Data related to the liver cancer RNA-seq dataset, which included 35 paired hepatocellular carcinoma (HCC) and non-tumor tissue samples, was used in this study. The differentially expressed genes (DEGs) were identified after performing pre-filtering and normalization. After that, a multi-objective optimization technique, namely multi-objective optimization for collecting cluster alternatives (MOCCA), was used to discover the Pareto-optimal cluster size for these DEGs. Then, the k-means clustering method was performed on the RNA-seq data. The best cluster, as a signature for the disease, was found by calculating the average Spearman's correlation score of all genes in the module in a pair-wise manner. All analyses were performed in the R 4.1.1 package in virtual space with 100 Gb of RAM memory.
Using MOCCA, eight Pareto-optimal clusters were obtained. Ultimately, two clusters with the greatest average Spearman's correlation coefficient scores were chosen as gene signatures. Eleven prognostic genes involved in HCC's abnormal metabolism were identified. In addition, three differentially expressed pathways were identified between tumor and non-tumor tissues.
These identified metabolic prognostic genes help us to provide more powerful prognostic information and enhance survival prediction for HCC patients. In addition, Pareto-optimal cluster size identification is suggested for gene signature in other RNA-Seq data.
本研究旨在利用帕累托最优聚类大小识别方法检测RNA测序(RNA-seq)数据中的基因特征。
近年来,RNA-seq已成为转录组分析的一项重要技术。涉及数十个基因的基因表达特征已被证明可预测疾病类型和患者对治疗的反应。
本研究使用了与肝癌RNA-seq数据集相关的数据,该数据集包括35对肝细胞癌(HCC)和非肿瘤组织样本。在进行预过滤和标准化后,识别差异表达基因(DEG)。之后,使用一种多目标优化技术,即收集聚类备选方案的多目标优化(MOCCA),来发现这些DEG的帕累托最优聚类大小。然后,对RNA-seq数据进行k均值聚类方法。通过成对计算模块中所有基因的平均斯皮尔曼相关得分,找到最佳聚类作为疾病的特征。所有分析均在具有100 Gb随机存取存储器的虚拟空间中的R 4.1.1软件包中进行。
使用MOCCA获得了8个帕累托最优聚类。最终,选择平均斯皮尔曼相关系数得分最高的两个聚类作为基因特征。鉴定出11个与HCC异常代谢相关的预后基因。此外,在肿瘤组织和非肿瘤组织之间鉴定出3条差异表达途径。
这些鉴定出的代谢预后基因有助于我们提供更强大的预后信息,并增强对HCC患者的生存预测。此外,建议在其他RNA-Seq数据中对基因特征进行帕累托最优聚类大小识别。