Rahman Raza Ur, Ahmad Iftikhar, Li Zixiu, Sparks Robert P, Ben Saad Amel, Mullen Alan C
Division of Gastroenterology, University of Massachusetts Chan Medical School, Worcester, MA, 01605, USA.
Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA.
Sci Rep. 2025 Aug 12;15(1):29542. doi: 10.1038/s41598-025-13528-9.
Single cell RNA sequencing (scRNA-seq) has revolutionized the study of gene expression in individual cell types, but scRNA-seq studies have focused primarily on expression of protein-coding genes. Long noncoding RNAs (lncRNAs) are more diverse than protein-coding genes, yet remain underexplored in part because they are underrepresented in reference annotations applied to scRNA-seq. Merging annotations containing protein-coding and lncRNA genes is not sufficient, because the addition of lncRNA genes that overlap in sense and antisense with protein-coding genes will affect how reads are counted for both protein-coding and lncRNA genes. Here, we introduce Singletrome, a Singularity image that integrates protein-coding and lncRNA gene transfer format (GTF) annotations to generate enhanced annotations that take into account the sense and antisense overlap of annotated genes, maps scRNA-seq data, and produces files for downstream analysis and visualization. With Singletrome, we detected thousands of lncRNAs not included in GENCODE, clustered cell types based solely on lncRNA expression, and demonstrated that machine learning can predict cell type and disease through lncRNAs alone. This comprehensive annotation will allow mapping of lncRNA expression across cell types of the human body, facilitating the development of an atlas of human lncRNAs in health and disease with the ability to integrate new lncRNA annotations as they become available.
单细胞RNA测序(scRNA-seq)彻底改变了对单个细胞类型中基因表达的研究,但scRNA-seq研究主要集中在蛋白质编码基因的表达上。长链非编码RNA(lncRNA)比蛋白质编码基因更加多样化,但部分仍未得到充分探索,因为它们在应用于scRNA-seq的参考注释中占比不足。合并包含蛋白质编码和lncRNA基因的注释是不够的,因为与蛋白质编码基因在正义链和反义链上重叠的lncRNA基因的添加会影响蛋白质编码基因和lncRNA基因的读数计数方式。在这里,我们引入了Singletrome,这是一个奇点镜像,它整合了蛋白质编码和lncRNA基因转移格式(GTF)注释,以生成增强注释,该注释考虑了注释基因的正义链和反义链重叠,映射scRNA-seq数据,并生成用于下游分析和可视化的文件。使用Singletrome,我们检测到了数千个未包含在GENCODE中的lncRNA,仅基于lncRNA表达对细胞类型进行聚类,并证明机器学习可以仅通过lncRNA预测细胞类型和疾病。这种全面的注释将允许绘制lncRNA在人体各细胞类型中的表达图谱,有助于开发健康和疾病状态下人类lncRNA图谱,并能够在新的lncRNA注释可用时将其整合进来。