Ayllon-Benitez Aaron, Bourqui Romain, Thébault Patricia, Mougin Fleur
University of Bordeaux, Inserm UMR 1219, Bordeaux Population Health Research Center, team ERIAS, Bordeaux 33000, France.
University of Bordeaux, CNRS UMR 5800, LaBRI, Bordeaux 33400, France.
NAR Genom Bioinform. 2020 Mar 14;2(2):lqaa017. doi: 10.1093/nargab/lqaa017. eCollection 2020 Jun.
The revolution in new sequencing technologies is greatly leading to new understandings of the relations between genotype and phenotype. To interpret and analyze data that are grouped according to a phenotype of interest, methods based on statistical enrichment became a standard in biology. However, these methods synthesize the biological information by selecting the over-represented terms and may suffer from focusing on the most studied genes that represent a limited coverage of annotated genes within a gene set. Semantic similarity measures have shown great results within the pairwise gene comparison by making advantage of the underlying structure of the Gene Ontology. We developed GSAn, a novel gene set annotation method that uses semantic similarity measures to synthesize Gene Ontology annotation terms. The originality of our approach is to identify the best compromise between the number of retained annotation terms that has to be drastically reduced and the number of related genes that has to be as large as possible. Moreover, GSAn offers interactive visualization facilities dedicated to the multi-scale analysis of gene set annotations. Compared to enrichment analysis tools, GSAn has shown excellent results in terms of maximizing the gene coverage while minimizing the number of terms.
新测序技术的革命极大地推动了人们对基因型与表型之间关系的新认识。为了解释和分析根据感兴趣的表型分组的数据,基于统计富集的方法已成为生物学中的标准方法。然而,这些方法通过选择过度代表的术语来综合生物信息,可能会因专注于研究最多的基因而受到影响,这些基因在基因集中仅占有限的注释基因覆盖范围。语义相似性度量通过利用基因本体论的底层结构,在成对基因比较中显示出了很好的效果。我们开发了GSAn,一种新颖的基因集注释方法,它使用语义相似性度量来综合基因本体论注释术语。我们方法的独特之处在于,要在必须大幅减少的保留注释术语数量与尽可能多的相关基因数量之间找到最佳平衡。此外,GSAn提供了专门用于基因集注释多尺度分析的交互式可视化工具。与富集分析工具相比,GSAn在最大化基因覆盖范围同时最小化术语数量方面显示出了优异的结果。