Suppr超能文献

基因本体注释的自动提取及其与蛋白质网络中聚类的相关性。

Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks.

作者信息

Daraselia Nikolai, Yuryev Anton, Egorov Sergei, Mazo Ilya, Ispolatov Iaroslav

机构信息

Ariadne Genomics, Inc, Rockville, MD 20850, USA.

出版信息

BMC Bioinformatics. 2007 Jul 10;8:243. doi: 10.1186/1471-2105-8-243.

Abstract

BACKGROUND

Uncovering cellular roles of a protein is a task of tremendous importance and complexity that requires dedicated experimental work as well as often sophisticated data mining and processing tools. Protein functions, often referred to as its annotations, are believed to manifest themselves through topology of the networks of inter-proteins interactions. In particular, there is a growing body of evidence that proteins performing the same function are more likely to interact with each other than with proteins with other functions. However, since functional annotation and protein network topology are often studied separately, the direct relationship between them has not been comprehensively demonstrated. In addition to having the general biological significance, such demonstration would further validate the data extraction and processing methods used to compose protein annotation and protein-protein interactions datasets.

RESULTS

We developed a method for automatic extraction of protein functional annotation from scientific text based on the Natural Language Processing (NLP) technology. For the protein annotation extracted from the entire PubMed, we evaluated the precision and recall rates, and compared the performance of the automatic extraction technology to that of manual curation used in public Gene Ontology (GO) annotation. In the second part of our presentation, we reported a large-scale investigation into the correspondence between communities in the literature-based protein networks and GO annotation groups of functionally related proteins. We found a comprehensive two-way match: proteins within biological annotation groups form significantly denser linked network clusters than expected by chance and, conversely, densely linked network communities exhibit a pronounced non-random overlap with GO groups. We also expanded the publicly available GO biological process annotation using the relations extracted by our NLP technology. An increase in the number and size of GO groups without any noticeable decrease of the link density within the groups indicated that this expansion significantly broadens the public GO annotation without diluting its quality. We revealed that functional GO annotation correlates mostly with clustering in a physical interaction protein network, while its overlap with indirect regulatory network communities is two to three times smaller.

CONCLUSION

Protein functional annotations extracted by the NLP technology expand and enrich the existing GO annotation system. The GO functional modularity correlates mostly with the clustering in the physical interaction network, suggesting that the essential role of structural organization maintained by these interactions. Reciprocally, clustering of proteins in physical interaction networks can serve as an evidence for their functional similarity.

摘要

背景

揭示一种蛋白质的细胞功能是一项极为重要且复杂的任务,需要专门的实验工作以及通常复杂的数据挖掘和处理工具。蛋白质功能,通常被称为其注释,被认为是通过蛋白质间相互作用网络的拓扑结构来体现的。特别是,越来越多的证据表明,执行相同功能的蛋白质比执行其他功能的蛋白质更有可能相互作用。然而,由于功能注释和蛋白质网络拓扑结构通常是分开研究的,它们之间的直接关系尚未得到全面证明。除了具有一般生物学意义外,这种证明还将进一步验证用于构建蛋白质注释和蛋白质 - 蛋白质相互作用数据集的数据提取和处理方法。

结果

我们基于自然语言处理(NLP)技术开发了一种从科学文本中自动提取蛋白质功能注释的方法。对于从整个PubMed中提取的蛋白质注释,我们评估了精确率和召回率,并将自动提取技术的性能与公共基因本体(GO)注释中使用的人工整理性能进行了比较。在我们展示的第二部分,我们报告了对基于文献的蛋白质网络中的群落与功能相关蛋白质的GO注释组之间对应关系的大规模调查。我们发现了全面的双向匹配:生物学注释组内的蛋白质形成的连接网络簇比随机预期的要密集得多,相反,紧密连接的网络群落与GO组表现出明显的非随机重叠。我们还使用我们的NLP技术提取的关系扩展了公开可用的GO生物学过程注释。GO组数量和大小的增加而组内连接密度没有任何明显下降表明,这种扩展显著拓宽了公共GO注释而没有稀释其质量。我们发现GO功能注释主要与物理相互作用蛋白质网络中的聚类相关,而其与间接调控网络群落的重叠要小三分之二。

结论

通过NLP技术提取的蛋白质功能注释扩展并丰富了现有的GO注释系统。GO功能模块性主要与物理相互作用网络中的聚类相关,表明这些相互作用维持的结构组织起着至关重要的作用。相反,蛋白质在物理相互作用网络中的聚类可以作为它们功能相似性的证据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f0b8/1940026/30862a2a1319/1471-2105-8-243-1.jpg

相似文献

1
2
An evaluation of GO annotation retrieval for BioCreAtIvE and GOA.
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S17. doi: 10.1186/1471-2105-6-S1-S17. Epub 2005 May 24.
3
Evaluation of BioCreAtIvE assessment of task 2.
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S16. doi: 10.1186/1471-2105-6-S1-S16. Epub 2005 May 24.
4
Biochemical networks: the evolution of gene annotation.
Nat Chem Biol. 2010 Jan;6(1):4-5. doi: 10.1038/nchembio.288.
6
Discovering gene annotations in biomedical text databases.
BMC Bioinformatics. 2008 Mar 6;9:143. doi: 10.1186/1471-2105-9-143.
7
Improving automatic GO annotation with semantic similarity.
BMC Bioinformatics. 2022 Dec 12;23(Suppl 2):433. doi: 10.1186/s12859-022-04958-7.
9
The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology.
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D262-6. doi: 10.1093/nar/gkh021.

引用本文的文献

1
Mantis: flexible and consensus-driven genome annotation.
Gigascience. 2021 Jun 2;10(6). doi: 10.1093/gigascience/giab042.
2
Profiling of indole metabolic pathway in thermo-sensitive Bainong male sterile line in wheat ( L.).
Physiol Mol Biol Plants. 2019 Jan;25(1):263-275. doi: 10.1007/s12298-018-0626-0. Epub 2018 Nov 28.
4
BC4GO: a full-text corpus for the BioCreative IV GO task.
Database (Oxford). 2014 Jul 28;2014. doi: 10.1093/database/bau074. Print 2014.
5
Clustering gene expression regulators: new approach to disease subtyping.
PLoS One. 2014 Jan 9;9(1):e84955. doi: 10.1371/journal.pone.0084955. eCollection 2014.
6
Clustering based on multiple biological information: approach for predicting protein complexes.
IET Syst Biol. 2013 Oct;7(5):223-30. doi: 10.1049/iet-syb.2012.0052.
7
Exploring molecular pathways of triple-negative breast cancer.
Genes Cancer. 2011 Sep;2(9):870-9. doi: 10.1177/1947601911432496.
9
Mining the Gene Wiki for functional genomic knowledge.
BMC Genomics. 2011 Dec 13;12:603. doi: 10.1186/1471-2164-12-603.
10
A comparison of the functional modules identified from time course and static PPI network data.
BMC Bioinformatics. 2011 Aug 15;12:339. doi: 10.1186/1471-2105-12-339.

本文引用的文献

1
Finding mesoscopic communities in sparse networks.
J Stat Mech. 2006 Sep 1;9:p09014. doi: 10.1088/1742-5468/2006/09/P09014.
2
Aggregative organization enhances the DNA end-joining process that is mediated by DNA-dependent protein kinase.
FEBS J. 2006 Jul;273(13):3063-75. doi: 10.1111/j.1742-4658.2006.05317.x. Epub 2006 Jun 6.
3
Automatic pathway building in biological association networks.
BMC Bioinformatics. 2006 Mar 24;7:171. doi: 10.1186/1471-2105-7-171.
4
Global landscape of protein complexes in the yeast Saccharomyces cerevisiae.
Nature. 2006 Mar 30;440(7084):637-43. doi: 10.1038/nature04670. Epub 2006 Mar 22.
6
A protein interaction network of the malaria parasite Plasmodium falciparum.
Nature. 2005 Nov 3;438(7064):103-7. doi: 10.1038/nature04104.
7
Binding properties and evolution of homodimers in protein-protein interaction networks.
Nucleic Acids Res. 2005 Jun 27;33(11):3629-35. doi: 10.1093/nar/gki678. Print 2005.
8
Data-poor categorization and passage retrieval for gene ontology annotation in Swiss-Prot.
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S23. doi: 10.1186/1471-2105-6-S1-S23. Epub 2005 May 24.
9
Mining protein function from text using term-based support vector machines.
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S22. doi: 10.1186/1471-2105-6-S1-S22. Epub 2005 May 24.
10
Finding genomic ontology terms in text using evidence content.
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S21. doi: 10.1186/1471-2105-6-S1-S21. Epub 2005 May 24.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验