Hou Minmei, Berman Piotr, Hsu Chih-Hao, Harris Robert S
Department of Computer Science & Engineering, Penn State University, PA, USA.
Bioinformatics. 2007 Apr 15;23(8):917-25. doi: 10.1093/bioinformatics/btm048. Epub 2007 Feb 18.
Complex genomes contain numerous repeated sequences, and genomic duplication is believed to be a main evolutionary mechanism to obtain new functions. Several tools are available for de novo repeat sequence identification, and many approaches exist for clustering homologous protein sequences. We present an efficient new approach to identify and cluster homologous DNA sequences with high accuracy at the level of whole genomes, excluding low-complexity repeats, tandem repeats and annotated interspersed repeats. We also determine the boundaries of each group member so that it closely represents a biological unit, e.g. a complete gene, or a partial gene coding a protein domain.
We developed a program called HomologMiner to identify homologous groups applicable to genome sequences that have been properly marked for low-complexity repeats and annotated interspersed repeats. We applied it to the whole genomes of human (hg17), macaque (rheMac2) and mouse (mm8). Groups obtained include gene families (e.g. olfactory receptor gene family, zinc finger families), unannotated interspersed repeats and additional homologous groups that resulted from recent segmental duplications. Our program incorporates several new methods: a new abstract definition of consistent duplicate units, a new criterion to remove moderately frequent tandem repeats, and new algorithmic techniques. We also provide preliminary analysis of the output on the three genomes mentioned above, and show several applications including identifying boundaries of tandem gene clusters and novel interspersed repeat families.
All programs and datasets are downloadable from www.bx.psu.edu/miller_lab.
复杂基因组包含大量重复序列,基因组复制被认为是获得新功能的主要进化机制。有多种工具可用于从头重复序列识别,也存在许多用于聚类同源蛋白质序列的方法。我们提出了一种高效的新方法,可在全基因组水平上高精度地识别和聚类同源DNA序列,排除低复杂度重复序列、串联重复序列和已注释的散布重复序列。我们还确定每个组成员的边界,以便其紧密代表一个生物学单元,例如一个完整基因或编码蛋白质结构域的部分基因。
我们开发了一个名为HomologMiner的程序,用于识别适用于已正确标记低复杂度重复序列和已注释散布重复序列的基因组序列的同源组。我们将其应用于人类(hg17)、猕猴(rheMac2)和小鼠(mm8)的全基因组。获得的组包括基因家族(如嗅觉受体基因家族、锌指家族)、未注释的散布重复序列以及近期片段重复产生的其他同源组。我们的程序纳入了几种新方法:一致重复单元的新抽象定义、去除中度频繁串联重复序列的新标准以及新的算法技术。我们还对上述三个基因组的输出进行了初步分析,并展示了几种应用,包括识别串联基因簇的边界和新的散布重复序列家族。
所有程序和数据集均可从www.bx.psu.edu/miller_lab下载。