Suppr超能文献

基于同源性的蛋白质重复序列识别方法及其统计学显著性估计

Homology-based method for identification of protein repeats using statistical significance estimates.

作者信息

Andrade M A, Ponting C P, Gibson T J, Bork P

机构信息

European Molecular Biology Laboratory, Meyerhofstr. 1, Heidelberg, 69012, Germany.

出版信息

J Mol Biol. 2000 May 5;298(3):521-37. doi: 10.1006/jmbi.2000.3684.

Abstract

Short protein repeats, frequently with a length between 20 and 40 residues, represent a significant fraction of known proteins. Many repeats appear to possess high amino acid substitution rates and thus recognition of repeat homologues is highly problematic. Even if the presence of a certain repeat family is known, the exact locations and the number of repetitive units often cannot be determined using current methods. We have devised an iterative algorithm based on optimal and sub-optimal score distributions from profile analysis that estimates the significance of all repeats that are detected in a single sequence. This procedure allows the identification of homologues at alignment scores lower than the highest optimal alignment score for non-homologous sequences. The method has been used to investigate the occurrence of eleven families of repeats in Saccharomyces cerevisiae, Caenorhabditis elegans and Homo sapiens accounting for 1055, 2205 and 2320 repeats, respectively. For these examples, the method is both more sensitive and more selective than conventional homology search procedures. The method allowed the detection in the SwissProt database of more than 2000 previously unrecognised repeats belonging to the 11 families. In addition, the method was used to merge several repeat families that previously were supposed to be distinct, indicating common phylogenetic origins for these families.

摘要

短蛋白质重复序列,通常长度在20至40个残基之间,在已知蛋白质中占相当大的比例。许多重复序列似乎具有较高的氨基酸替代率,因此识别重复序列同源物极具挑战性。即使已知某个重复序列家族的存在,使用当前方法通常也无法确定其确切位置和重复单元的数量。我们设计了一种基于轮廓分析的最优和次优得分分布的迭代算法,该算法可估计在单个序列中检测到的所有重复序列的显著性。此过程能够在比对得分低于非同源序列最高最优比对得分的情况下识别同源物。该方法已用于研究酿酒酵母、秀丽隐杆线虫和智人中11个重复序列家族的出现情况,分别涉及1055个、2205个和2320个重复序列。对于这些实例,该方法比传统的同源性搜索程序更灵敏且更具选择性。该方法使得在SwissProt数据库中检测到2000多个先前未识别的属于这11个家族的重复序列。此外,该方法还用于合并几个先前被认为是不同的重复序列家族,表明这些家族具有共同的系统发育起源。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验