Department of Molecular Life Sciences, University of Zurich, 8057 Zurich, Switzerland.
Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland.
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac355.
A knowledge-based grouping of genes into pathways or functional units is essential for describing and understanding cellular complexity. However, it is not always clear a priori how and at what level of specificity functionally interconnected genes should be partitioned into pathways, for a given application. Here, we assess and compare nine existing and two conceptually novel functional classification systems, with respect to their discovery power and generality in gene set enrichment testing. We base our assessment on a collection of nearly 2000 functional genomics datasets provided by users of the STRING database. With these real-life and diverse queries, we assess which systems typically provide the most specific and complete enrichment results. We find many structural and performance differences between classification systems. Overall, the well-established, hierarchically organized pathway annotation systems yield the best enrichment performance, despite covering substantial parts of the human genome in general terms only. On the other hand, the more recent unsupervised annotation systems perform strongest in understudied areas and organisms, and in detecting more specific pathways, albeit with less informative labels.
基于知识的基因分组为途径或功能单元对于描述和理解细胞复杂性至关重要。然而,对于给定的应用程序,如何以及在什么特异性水平上将功能上相互关联的基因划分到途径中,这在事先并不总是清楚的。在这里,我们评估和比较了九种现有的和两种概念新颖的功能分类系统,就其在基因集富集测试中的发现能力和通用性而言。我们的评估基于 STRING 数据库用户提供的近 2000 个功能基因组数据集。通过这些现实生活和多样化的查询,我们评估了哪些系统通常提供最具体和完整的富集结果。我们发现分类系统之间存在许多结构和性能差异。总体而言,尽管通常仅从总体上涵盖人类基因组的大部分内容,但经过良好验证的层次组织途径注释系统可提供最佳的富集性能。另一方面,最近的无监督注释系统在研究较少的领域和生物体中表现最强,并能检测到更具体的途径,尽管标签的信息量较少。