Brown N P, Whittaker A J, Newell W R, Rawlings C J, Beck S
Biomedical Informatics Unit, Imperial Cancer Research Fund, London, UK.
J Mol Biol. 1995 Jun 2;249(2):342-59. doi: 10.1006/jmbi.1995.0301.
Gene families are often recognised by sequence homology using similarity searching to find relationships, however, genomic sequence data provides gene architectural information not used by conventional search methods. In particular, intron positions and phases are expected to be relatively conserved features, because mis-splicing and reading frame shifts should be selected against. A fast search technique capable of detecting possible weak sequence homologies apparent at the intron/exon level of gene organization is presented for comparing spliceosomal genes and gene fragments. FINEX compares strings of exons delimited by intron/exon boundary positions and intron phases (exon fingerprint) using a global dynamic programming algorithm with a combined intron phase identity and exon size dissimilarity score. Exon fingerprints are typically two orders of magnitude smaller than their nucleic acid sequence counterparts giving rise to fast search times: a ranked search against a library of 6755 fingerprints for a typical three exon fingerprint completes in under 30 seconds on an ordinary workstation, while a worst case largest fingerprint of 52 exons completes in just over one minute. The short "sequence" length of exon fingerprints in comparisons is compensated for by the large exon alphabet compounded of intron phase types and a wide range of exon sizes, the latter contributing the most information to alignments. FINEX performs better in some searches than conventional methods, finding matches with similar exon organization, but low sequence homology. A search using a human serum albumin finds all members of the multigene family in the FINEX database at the top of the search ranking, despite very low amino acid percentage identities between family members. The method should complement conventional sequence searching and alignment techniques, offering a means of identifying otherwise hard to detect homologies where genomic data are available.
基因家族通常通过序列同源性来识别,利用相似性搜索寻找基因间的关系。然而,基因组序列数据提供了传统搜索方法未利用的基因结构信息。特别是,内含子位置和相位预计是相对保守的特征,因为错配剪接和读框移位应被选择淘汰。本文提出了一种快速搜索技术,能够检测在基因组织的内含子/外显子水平上明显的可能的弱序列同源性,用于比较剪接体基因和基因片段。FINEX使用全局动态规划算法,结合内含子相位一致性和外显子大小差异得分,比较由内含子/外显子边界位置和内含子相位界定的外显子串(外显子指纹)。外显子指纹通常比其核酸序列对应物小两个数量级,从而实现快速搜索:在普通工作站上,针对包含6755个指纹的库对典型的三个外显子指纹进行排序搜索,不到30秒即可完成,而最坏情况下52个外显子的最大指纹在一分钟多一点即可完成。比较中外显子指纹较短的“序列”长度,通过由内含子相位类型和广泛的外显子大小组成的大外显子字母表得到补偿,后者为比对贡献了最多的信息。在某些搜索中,FINEX比传统方法表现更好,能够找到具有相似外显子组织但序列同源性较低的匹配。使用人血清白蛋白进行的搜索在FINEX数据库中找到了多基因家族的所有成员,且都排在搜索排名的前列,尽管家族成员之间的氨基酸百分比一致性非常低。该方法应能补充传统的序列搜索和比对技术,提供一种在有基因组数据时识别其他难以检测到的同源性的方法。