Vries John K, Liu Xiong
Department of Computational Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA.
BMC Bioinformatics. 2008 Jan 30;9:72. doi: 10.1186/1471-2105-9-72.
A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{n,m}) which are sets of n residues and m wildcards in windows of size n+m. The generation of conservation profiles is treated as a signal-to-noise problem where the signal is the count of n-gram patterns in target sequences that are similar to the query sequence and the noise is the count over all target sequences. The signal is differentiated from the noise by applying singular value decomposition to sets of target sequences rank ordered by similarity with respect to the query.
The new algorithm was used to construct 4,248 profiles from 120 randomly selected Pfam-A families. These were compared to profiles generated from multiple alignments using the consensus approach. The two profiles were similar whenever the subfamily associated with the query sequence was well represented in the multiple alignment. It was possible to construct subfamily specific conservation profiles using the new algorithm for subfamilies with as few as five members. The speed of the new algorithm was comparable to the multiple alignment approach.
Subfamily specific conservation profiles can be generated by the new algorithm without aprioi knowledge of family relationships or domain architecture. This is useful when the subfamily contains multiple domains with different levels of representation in protein databases. It may also be applicable when the subfamily sample size is too small for the multiple alignment approach.
已开发出一种新算法,用于生成反映与查询序列相关的亚家族进化历史的保守性图谱。该算法基于n元语法模式(NP{n,m}),即在大小为n+m的窗口中由n个残基和m个通配符组成的集合。保守性图谱的生成被视为一个信号与噪声的问题,其中信号是目标序列中与查询序列相似的n元语法模式的计数,噪声是所有目标序列上的计数。通过对与查询序列相似度排序的目标序列集应用奇异值分解,将信号与噪声区分开来。
新算法用于从120个随机选择的Pfam-A家族构建4248个图谱。将这些图谱与使用一致性方法从多序列比对生成的图谱进行比较。只要与查询序列相关的亚家族在多序列比对中有很好的代表性,这两种图谱就相似。对于成员少至五个的亚家族,使用新算法可以构建亚家族特异性的保守性图谱。新算法的速度与多序列比对方法相当。
新算法可以生成亚家族特异性的保守性图谱,而无需事先了解家族关系或结构域架构。当亚家族包含在蛋白质数据库中有不同代表性水平的多个结构域时,这很有用。当亚家族样本量对于多序列比对方法来说太小时,它也可能适用。