在包括远程蛋白质序列的全基因组数据集（包括远程蛋白质序列）中自动识别高度保守的家族区域和关系。

Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences.

机构信息

Biotechnology and Bioengineering Graduate Program, Izmir Institute of Technology, Izmir, Turkey ; Institute of Health Sciences, Dokuz Eylul University, Izmir, Turkey.

出版信息

PLoS One. 2013 Sep 12;8(9):e75458. doi: 10.1371/journal.pone.0075458. eCollection 2013.

DOI:10.1371/journal.pone.0075458

PMID:24069417

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3771926/

Abstract

Identifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains remote sequences with poor alignment to the rest, or sequences containing multiple domains. In this paper, we propose a completely unsupervised and automated method to identify the shared sequence segments observed in a diverse collection of protein sequences including those present in a smaller fraction of the sequences in the collection, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches. Since shared sequence fragments often imply conserved functional or structural attributes, the method produces a table of associations between the sequences and the identified conserved regions that can reveal previously unknown protein families as well as new members to existing ones. We evaluated the biological relevance of the method by clustering the proteins in gold standard datasets and assessing the clustering performance in comparison with previous methods from the literature. We have then applied the proposed method to a genome wide dataset of 17793 human proteins and generated a global association map to each of the 4753 identified conserved regions. Investigations on the major conserved regions revealed that they corresponded strongly to annotated structural domains. This suggests that the method can be useful in predicting novel domains on protein sequences.

摘要

确定氨基酸序列中的共享序列片段通常需要一组密切相关的蛋白质，这些蛋白质通常是从序列数据集中手动整理出来的，以适应手头的目的。然而，当集合中包含与其他序列对齐不良的远程序列或包含多个域的序列时，当前开发的统计方法就会受到限制。在本文中，我们提出了一种完全无监督和自动化的方法，该方法使用序列比对、残基保守性评分和图论方法的组合，来识别包括在序列集合中较小部分的序列中存在的多样化蛋白质序列中观察到的共享序列片段。由于共享序列片段通常暗示保守的功能或结构属性，因此该方法生成了一个序列和识别的保守区域之间的关联表，该表可以揭示以前未知的蛋白质家族以及现有家族的新成员。我们通过对金标准数据集的蛋白质进行聚类，并与文献中的先前方法进行聚类性能评估，评估了该方法的生物学相关性。然后，我们将所提出的方法应用于 17793 个人类蛋白质的全基因组数据集，并为 4753 个鉴定的保守区域中的每一个生成了全局关联图。对主要保守区域的研究表明，它们与注释的结构域强烈对应。这表明该方法可用于预测蛋白质序列上的新结构域。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e78/3771926/dabc000ca1e5/pone.0075458.g001.jpg

相似文献

Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences.

PLoS One. 2013 Sep 12;8(9):e75458. doi: 10.1371/journal.pone.0075458. eCollection 2013.

Sequence-based enzyme catalytic domain prediction using clustering and aggregated mutual information content.

J Bioinform Comput Biol. 2011 Oct;9(5):597-611. doi: 10.1142/s0219720011005677.

Joint evolutionary trees: a large-scale method to predict protein interfaces based on sequence sampling.

PLoS Comput Biol. 2009 Jan;5(1):e1000267. doi: 10.1371/journal.pcbi.1000267. Epub 2009 Jan 23.

DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment.

BMC Bioinformatics. 2005 Mar 22;6:66. doi: 10.1186/1471-2105-6-66.

Predicting ligand binding residues and functional sites using multipositional correlations with graph theoretic clustering and kernel CCA.

IEEE/ACM Trans Comput Biol Bioinform. 2012 Jul-Aug;9(4):992-1001. doi: 10.1109/TCBB.2011.136.

Functional region prediction with a set of appropriate homologous sequences--an index for sequence selection by integrating structure and sequence information with spatial statistics.

BMC Struct Biol. 2012 May 29;12:11. doi: 10.1186/1472-6807-12-11.

Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment.

BMC Bioinformatics. 2013;14 Suppl 11(Suppl 11):S2. doi: 10.1186/1471-2105-14-S11-S2. Epub 2013 Sep 13.

Automated search of natively folded protein fragments for high-throughput structure determination in structural genomics.

Protein Sci. 2000 Dec;9(12):2313-21. doi: 10.1110/ps.9.12.2313.

HMMerThread: detecting remote, functional conserved domains in entire genomes by combining relaxed sequence-database searches with fold recognition.

PLoS One. 2011 Mar 10;6(3):e17568. doi: 10.1371/journal.pone.0017568.

FASSM: enhanced function association in whole genome analysis using sequence and structural motifs.

In Silico Biol. 2005;5(5-6):425-38.

引用本文的文献

Rational Design of Profile HMMs for Sensitive and Specific Sequence Detection with Case Studies Applied to Viruses, Bacteriophages, and Casposons.

Viruses. 2023 Feb 13;15(2):519. doi: 10.3390/v15020519.

Protein domain-based prediction of drug/compound-target interactions and experimental validation on LIM kinases.

PLoS Comput Biol. 2021 Nov 29;17(11):e1009171. doi: 10.1371/journal.pcbi.1009171. eCollection 2021 Nov.

Evolutionary Conservation and Expression Patterns of Neutral/Alkaline Invertases in .

Biomolecules. 2019 Nov 21;9(12):763. doi: 10.3390/biom9120763.

UniProt-DAAC: domain architecture alignment and classification, a new method for automatic functional annotation in UniProtKB.

Bioinformatics. 2016 Aug 1;32(15):2264-71. doi: 10.1093/bioinformatics/btw114. Epub 2016 Mar 7.

本文引用的文献

The Pfam protein families database.

Nucleic Acids Res. 2012 Jan;40(Database issue):D290-301. doi: 10.1093/nar/gkr1065. Epub 2011 Nov 29.

InterPro in 2011: new developments in the family and domain prediction database.

Nucleic Acids Res. 2012 Jan;40(Database issue):D306-12. doi: 10.1093/nar/gkr948. Epub 2011 Nov 16.

Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Mol Syst Biol. 2011 Oct 11;7:539. doi: 10.1038/msb.2011.75.

HMMER web server: interactive sequence similarity searching.

Nucleic Acids Res. 2011 Jul;39(Web Server issue):W29-37. doi: 10.1093/nar/gkr367. Epub 2011 May 18.

Improving the quality of protein similarity network clustering algorithms using the network edge weight distribution.

Bioinformatics. 2011 Feb 1;27(3):326-33. doi: 10.1093/bioinformatics/btq655. Epub 2010 Nov 29.

CDD: a Conserved Domain Database for the functional annotation of proteins.

Nucleic Acids Res. 2011 Jan;39(Database issue):D225-9. doi: 10.1093/nar/gkq1189. Epub 2010 Nov 24.

Extending CATH: increasing coverage of the protein structure universe and linking structure with function.

Nucleic Acids Res. 2011 Jan;39(Database issue):D420-6. doi: 10.1093/nar/gkq1001. Epub 2010 Nov 19.

SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale.

BMC Bioinformatics. 2010 Mar 9;11:120. doi: 10.1186/1471-2105-11-120.

Family classification without domain chaining.

Bioinformatics. 2009 Jun 15;25(12):i45-53. doi: 10.1093/bioinformatics/btp207.

Infrastructure for the life sciences: design and implementation of the UniProt website.

BMC Bioinformatics. 2009 May 8;10:136. doi: 10.1186/1471-2105-10-136.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

在包括远程蛋白质序列的全基因组数据集（包括远程蛋白质序列）中自动识别高度保守的家族区域和关系。

Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献