Huang Tinghua, Niu Siqi, Zhang Fanghong, Wang Binyu, Wang Jianwu, Liu Guoping, Yao Min
College of Animal Science and Technology, Yangtze University, Jingzhou, China.
College of Agriculture, Yangtze University, Jingzhou, China.
Front Genet. 2024 Nov 29;15:1511456. doi: 10.3389/fgene.2024.1511456. eCollection 2024.
Identification of key transcription factors from transcriptome data by correlating gene expression levels with transcription factor binding sites is important for transcriptome data analysis. In a typical scenario, we always set a threshold to filter the top ranked differentially expressed genes and top ranked transcription factor binding sites. However, correlation analysis of filtered data can often result in spurious correlations. In this study, we tested four methods for creating the gene expression inputs (ranked gene list) in the correlation analysis: star coordinate map transformation (START), expression differential score (ED), preferential expression measure (PEM), and the specificity measure (SPM). Then, Kendall's tau correlation statistical algorithms implementing the standard (STD), LINEAR, MIX-LINEAR, DENSITY-CURVE, and MIXED-DENSITY-CURVE weighting methods were used to identify key transcription factors. ED was identified as the optimal method for creating a ranked gene list from filtered expression data, which can address the "unable to detect negative correlation" fallacy presented by other methods. The MIXED-DENSITY-CURVE was the most sensitive for identifying transcription factors from the gene set and list in which only the top proportion was correlated. Ultimately, 644 transcription factor candidates were identified from the transcriptome data of 1,206 cell lines, six of which were validated by wet lab experiments. The Jinzer and Flaver software implementing these methods can be obtained from http://www.thua45/cn/flaver under a free academic license.
通过将基因表达水平与转录因子结合位点相关联,从转录组数据中识别关键转录因子对于转录组数据分析至关重要。在典型情况下,我们总是设定一个阈值来筛选排名靠前的差异表达基因和排名靠前的转录因子结合位点。然而,对过滤后的数据进行相关性分析往往会导致虚假相关性。在本研究中,我们测试了四种在相关性分析中创建基因表达输入(排名基因列表)的方法:星坐标图变换(START)、表达差异评分(ED)、优先表达度量(PEM)和特异性度量(SPM)。然后,使用实施标准(STD)、线性(LINEAR)、混合线性(MIX-LINEAR)、密度曲线(DENSITY-CURVE)和混合密度曲线(MIXED-DENSITY-CURVE)加权方法的肯德尔tau相关性统计算法来识别关键转录因子。ED被确定为从过滤后的表达数据创建排名基因列表的最佳方法,它可以解决其他方法所呈现的“无法检测到负相关性”的谬误。混合密度曲线对于从仅顶部比例相关的基因集和列表中识别转录因子最为敏感。最终,从1206个细胞系的转录组数据中鉴定出644个转录因子候选物,其中6个通过湿实验室实验得到验证。实现这些方法的Jinzer和Flaver软件可在免费学术许可下从http://www.thua45/cn/flaver获得。