Michielsen Lieke, Lotfollahi Mohammad, Strobl Daniel, Sikkema Lisa, Reinders Marcel J T, Theis Fabian J, Mahfouz Ahmed
Department of Human Genetics, Leiden University Medical Center, 2333ZC Leiden, The Netherlands.
Leiden Computational Biology Center, Leiden University Medical Center, 2333ZC Leiden, The Netherlands.
NAR Genom Bioinform. 2023 Jul 26;5(3):lqad070. doi: 10.1093/nargab/lqad070. eCollection 2023 Sep.
Single-cell genomics is now producing an ever-increasing amount of datasets that, when integrated, could provide large-scale reference atlases of tissue in health and disease. Such large-scale atlases increase the scale and generalizability of analyses and enable combining knowledge generated by individual studies. Specifically, individual studies often differ regarding cell annotation terminology and depth, with different groups specializing in different cell type compartments, often using distinct terminology. Understanding how these distinct sets of annotations are related and complement each other would mark a major step towards a consensus-based cell-type annotation reflecting the latest knowledge in the field. Whereas recent computational techniques, referred to as 'reference mapping' methods, facilitate the usage and expansion of existing reference atlases by mapping new datasets (i.e. queries) onto an atlas; a systematic approach towards harmonizing dataset-specific cell-type terminology and annotation depth is still lacking. Here, we present 'treeArches', a framework to automatically build and extend reference atlases while enriching them with an updatable hierarchy of cell-type annotations across different datasets. We demonstrate various use cases for treeArches, from automatically resolving relations between reference and query cell types to identifying unseen cell types absent in the reference, such as disease-associated cell states. We envision treeArches enabling data-driven construction of consensus atlas-level cell-type hierarchies and facilitating efficient usage of reference atlases.
单细胞基因组学目前正在产生越来越多的数据集,这些数据集整合后可以提供健康和疾病状态下组织的大规模参考图谱。这种大规模图谱提高了分析的规模和通用性,并能够整合个体研究产生的知识。具体而言,个体研究在细胞注释术语和深度方面往往存在差异,不同的研究团队专注于不同的细胞类型分区,通常使用不同的术语。了解这些不同的注释集如何相互关联和相互补充,将是朝着基于共识的细胞类型注释迈出的重要一步,这种注释能够反映该领域的最新知识。尽管最近被称为“参考映射”方法的计算技术通过将新数据集(即查询)映射到图谱上,促进了现有参考图谱的使用和扩展,但仍缺乏一种系统的方法来协调特定数据集的细胞类型术语和注释深度。在这里,我们提出了“treeArches”,这是一个自动构建和扩展参考图谱的框架,同时用跨不同数据集的可更新细胞类型注释层次结构丰富这些图谱。我们展示了treeArches的各种用例,从自动解析参考细胞类型和查询细胞类型之间的关系,到识别参考图谱中不存在的未见细胞类型,如疾病相关细胞状态。我们设想treeArches能够实现基于数据驱动的共识图谱级细胞类型层次结构构建,并促进参考图谱的高效使用。