Biotechnology and Bioengineering Graduate Program, Izmir Institute of Technology, Izmir, Turkey ; Institute of Health Sciences, Dokuz Eylul University, Izmir, Turkey.
PLoS One. 2013 Sep 12;8(9):e75458. doi: 10.1371/journal.pone.0075458. eCollection 2013.
Identifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains remote sequences with poor alignment to the rest, or sequences containing multiple domains. In this paper, we propose a completely unsupervised and automated method to identify the shared sequence segments observed in a diverse collection of protein sequences including those present in a smaller fraction of the sequences in the collection, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches. Since shared sequence fragments often imply conserved functional or structural attributes, the method produces a table of associations between the sequences and the identified conserved regions that can reveal previously unknown protein families as well as new members to existing ones. We evaluated the biological relevance of the method by clustering the proteins in gold standard datasets and assessing the clustering performance in comparison with previous methods from the literature. We have then applied the proposed method to a genome wide dataset of 17793 human proteins and generated a global association map to each of the 4753 identified conserved regions. Investigations on the major conserved regions revealed that they corresponded strongly to annotated structural domains. This suggests that the method can be useful in predicting novel domains on protein sequences.
确定氨基酸序列中的共享序列片段通常需要一组密切相关的蛋白质,这些蛋白质通常是从序列数据集中手动整理出来的,以适应手头的目的。然而,当集合中包含与其他序列对齐不良的远程序列或包含多个域的序列时,当前开发的统计方法就会受到限制。在本文中,我们提出了一种完全无监督和自动化的方法,该方法使用序列比对、残基保守性评分和图论方法的组合,来识别包括在序列集合中较小部分的序列中存在的多样化蛋白质序列中观察到的共享序列片段。由于共享序列片段通常暗示保守的功能或结构属性,因此该方法生成了一个序列和识别的保守区域之间的关联表,该表可以揭示以前未知的蛋白质家族以及现有家族的新成员。我们通过对金标准数据集的蛋白质进行聚类,并与文献中的先前方法进行聚类性能评估,评估了该方法的生物学相关性。然后,我们将所提出的方法应用于 17793 个人类蛋白质的全基因组数据集,并为 4753 个鉴定的保守区域中的每一个生成了全局关联图。对主要保守区域的研究表明,它们与注释的结构域强烈对应。这表明该方法可用于预测蛋白质序列上的新结构域。