Michielsen Lieke, Reinders Marcel J T, Mahfouz Ahmed
Department of Human Genetics, Leiden University Medical Center, Leiden, Netherlands.
Leiden Computational Biology Center, Leiden University Medical Center, Leiden, Netherlands.
Front Bioinform. 2024 Mar 4;4:1347276. doi: 10.3389/fbinf.2024.1347276. eCollection 2024.
Most regulatory elements, especially enhancer sequences, are cell population-specific. One could even argue that a distinct set of regulatory elements is what defines a cell population. However, discovering which non-coding regions of the DNA are essential in which context, and as a result, which genes are expressed, is a difficult task. Some computational models tackle this problem by predicting gene expression directly from the genomic sequence. These models are currently limited to predicting bulk measurements and mainly make tissue-specific predictions. Here, we present a model that leverages single-cell RNA-sequencing data to predict gene expression. We show that cell population-specific models outperform tissue-specific models, especially when the expression profile of a cell population and the corresponding tissue are dissimilar. Further, we show that our model can prioritize GWAS variants and learn motifs of transcription factor binding sites. We envision that our model can be useful for delineating cell population-specific regulatory elements.
大多数调控元件,尤其是增强子序列,具有细胞群体特异性。甚至可以说,一组独特的调控元件定义了一个细胞群体。然而,要发现DNA的哪些非编码区域在何种情况下是必不可少的,以及由此哪些基因会被表达,是一项艰巨的任务。一些计算模型通过直接从基因组序列预测基因表达来解决这个问题。这些模型目前仅限于预测总体测量值,并且主要进行组织特异性预测。在这里,我们提出了一个利用单细胞RNA测序数据来预测基因表达的模型。我们表明,细胞群体特异性模型优于组织特异性模型,尤其是当细胞群体的表达谱与相应组织的表达谱不同时。此外,我们表明我们的模型可以对全基因组关联研究(GWAS)变体进行优先级排序,并学习转录因子结合位点的基序。我们设想我们的模型可用于描绘细胞群体特异性调控元件。