Depledge Daniel P, Lower Ryan P J, Smith Deborah F
Immunology and Infection Unit, Department of Biology, University of York, Heslington, York, UK.
BMC Bioinformatics. 2007 Apr 11;8:122. doi: 10.1186/1471-2105-8-122.
Amino acid repeat-containing proteins have a broad range of functions and their identification is of relevance to many experimental biologists. In human-infective protozoan parasites (such as the Kinetoplastid and Plasmodium species), they are implicated in immune evasion and have been shown to influence virulence and pathogenicity. RepSeq http://repseq.gugbe.com is a new database of amino acid repeat-containing proteins found in lower eukaryotic pathogens. The RepSeq database is accessed via a web-based application which also provides links to related online tools and databases for further analyses.
The RepSeq algorithm typically identifies more than 98% of repeat-containing proteins and is capable of identifying both perfect and mismatch repeats. The proportion of proteins that contain repeat elements varies greatly between different families and even species (3-35% of the total protein content). The most common motif type is the Sequence Repeat Region (SRR)--a repeated motif containing multiple different amino acid types. Proteins containing Single Amino Acid Repeats (SAARs) and Di-Peptide Repeats (DPRs) typically account for 0.5-1.0% of the total protein number. Notable exceptions are P. falciparum and D. discoideum, in which 33.67% and 34.28% respectively of the predicted proteomes consist of repeat-containing proteins. These numbers are due to large insertions of low complexity single and multi-codon repeat regions.
The RepSeq database provides a repository for repeat-containing proteins found in parasitic protozoa. The database allows for both individual and cross-species proteome analyses and also allows users to upload sequences of interest for analysis by the RepSeq algorithm. Identification of repeat-containing proteins provides researchers with a defined subset of proteins which can be analysed by expression profiling and functional characterisation, thereby facilitating study of pathogenicity and virulence factors in the parasitic protozoa. While primarily designed for kinetoplastid work, the RepSeq algorithm and database retain full functionality when used to analyse other species.
含氨基酸重复序列的蛋白质具有广泛的功能,其鉴定对许多实验生物学家而言具有重要意义。在人类感染性原生动物寄生虫(如动质体和疟原虫物种)中,它们与免疫逃避有关,并已被证明会影响毒力和致病性。RepSeq(http://repseq.gugbe.com)是一个新的数据库,收录了在低等真核病原体中发现的含氨基酸重复序列的蛋白质。可通过基于网络的应用程序访问RepSeq数据库,该应用程序还提供指向相关在线工具和数据库的链接,以便进行进一步分析。
RepSeq算法通常能识别超过98%的含重复序列的蛋白质,并且能够识别完美重复序列和错配重复序列。含重复元件的蛋白质比例在不同家族甚至物种之间差异很大(占总蛋白质含量的3 - 35%)。最常见的基序类型是序列重复区域(SRR)——一种包含多种不同氨基酸类型的重复基序。含单氨基酸重复序列(SAARs)和二肽重复序列(DPRs)的蛋白质通常占蛋白质总数的0.5 - 1.0%。值得注意的例外是恶性疟原虫和盘基网柄菌,其预测蛋白质组中分别有33.67%和34.28%由含重复序列的蛋白质组成。这些数字是由于低复杂性单密码子和多密码子重复区域的大量插入所致。
RepSeq数据库为寄生原生动物中发现的含重复序列的蛋白质提供了一个储存库。该数据库允许进行个体和跨物种蛋白质组分析,还允许用户上传感兴趣的序列,以便通过RepSeq算法进行分析。含重复序列的蛋白质的鉴定为研究人员提供了一组特定的蛋白质,可通过表达谱分析和功能表征进行分析,从而有助于研究寄生原生动物中的致病性和毒力因子。虽然RepSeq算法和数据库主要设计用于动质体研究,但用于分析其他物种时仍保留全部功能。