Suppr超能文献

利用语义相似性度量方法,从大规模注释管道中最优地整合异构的基因本体论数据。

The use of semantic similarity measures for optimally integrating heterogeneous Gene Ontology data from large scale annotation pipelines.

机构信息

Computational Biology Group, Department of Clinical Laboratory Sciences, Institute of Infectious Disease and Molecular Medicine, University of Cape Town Cape Town, South Africa.

出版信息

Front Genet. 2014 Aug 6;5:264. doi: 10.3389/fgene.2014.00264. eCollection 2014.

Abstract

With the advancement of new high throughput sequencing technologies, there has been an increase in the number of genome sequencing projects worldwide, which has yielded complete genome sequences of human, animals and plants. Subsequently, several labs have focused on genome annotation, consisting of assigning functions to gene products, mostly using Gene Ontology (GO) terms. As a consequence, there is an increased heterogeneity in annotations across genomes due to different approaches used by different pipelines to infer these annotations and also due to the nature of the GO structure itself. This makes a curator's task difficult, even if they adhere to the established guidelines for assessing these protein annotations. Here we develop a genome-scale approach for integrating GO annotations from different pipelines using semantic similarity measures. We used this approach to identify inconsistencies and similarities in functional annotations between orthologs of human and Drosophila melanogaster, to assess the quality of GO annotations derived from InterPro2GO mappings compared to manually annotated GO annotations for the Drosophila melanogaster proteome from a FlyBase dataset and human, and to filter GO annotation data for these proteomes. Results obtained indicate that an efficient integration of GO annotations eliminates redundancy up to 27.08 and 22.32% in the Drosophila melanogaster and human GO annotation datasets, respectively. Furthermore, we identified lack of and missing annotations for some orthologs, and annotation mismatches between InterPro2GO and manual pipelines in these two proteomes, thus requiring further curation. This simplifies and facilitates tasks of curators in assessing protein annotations, reduces redundancy and eliminates inconsistencies in large annotation datasets for ease of comparative functional genomics.

摘要

随着新高通量测序技术的进步,全球范围内的基因组测序项目数量有所增加,已经获得了人类、动物和植物的完整基因组序列。随后,一些实验室专注于基因组注释,包括为基因产物分配功能,主要使用基因本体论 (GO) 术语。因此,由于不同管道用于推断这些注释的方法不同,以及 GO 结构本身的性质,不同基因组之间的注释存在更大的异质性。这使得注释者的任务变得困难,即使他们遵守评估这些蛋白质注释的既定准则。在这里,我们开发了一种使用语义相似性度量从不同管道整合 GO 注释的基因组规模方法。我们使用这种方法来识别人类和黑腹果蝇同源物之间功能注释的不一致和相似之处,评估基于 InterPro2GO 映射的 GO 注释与 FlyBase 数据集的黑腹果蝇蛋白质组的手动注释的 GO 注释的质量,以及为这些蛋白质组过滤 GO 注释数据。获得的结果表明,GO 注释的有效整合可分别消除黑腹果蝇和人类 GO 注释数据集中高达 27.08%和 22.32%的冗余。此外,我们确定了一些同源物缺乏和缺失注释,以及这两个蛋白质组中 InterPro2GO 和手动管道之间的注释不匹配,因此需要进一步注释。这简化并促进了注释者评估蛋白质注释的任务,减少了大型注释数据集的冗余并消除了不一致性,便于进行比较功能基因组学。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9e87/4123725/59488c7782cb/fgene-05-00264-g0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验