Department of Statistics, Harvard University, Cambridge, MA, USA.
BMC Genomics. 2010 Feb 10;11 Suppl 1(Suppl 1):S10. doi: 10.1186/1471-2164-11-S1-S10.
Transposons are "jumping genes" that account for large quantities of repetitive content in genomes. They are known to affect transcriptional regulation in several different ways, and are implicated in many human diseases. Transposons are related to microRNAs and viruses, and many genes, pseudogenes, and gene promoters are derived from transposons or have origins in transposon-induced duplication. Modeling transposon-derived genomic content is difficult because they are poorly conserved. Profile hidden Markov models (profile HMMs), widely used for protein sequence family modeling, are rarely used for modeling DNA sequence families. The algorithm commonly used to estimate the parameters of profile HMMs, Baum-Welch, is prone to prematurely converge to local optima. The DNA domain is especially problematic for the Baum-Welch algorithm, since it has only four letters as opposed to the twenty residues of the amino acid alphabet.
We demonstrate with a simulation study and with an application to modeling the MIR family of transposons that two recently introduced methods, Conditional Baum-Welch and Dynamic Model Surgery, achieve better estimates of the parameters of profile HMMs across a range of conditions.
We argue that these new algorithms expand the range of potential applications of profile HMMs to many important DNA sequence family modeling problems, including that of searching for and modeling the virus-like transposons that are found in all known genomes.
转座子是“跳跃基因”,它们在基因组中占据了大量的重复内容。已知它们以多种不同的方式影响转录调控,并与许多人类疾病有关。转座子与 microRNAs 和病毒有关,许多基因、假基因和基因启动子都来自转座子或由转座子诱导的复制产生。由于转座子的保守性较差,因此对其衍生的基因组内容进行建模是很困难的。广泛用于蛋白质序列家族建模的轮廓隐马尔可夫模型(profile HMM)很少用于 DNA 序列家族建模。用于估计 profile HMM 参数的常用算法,即 Baum-Welch 算法,容易过早地收敛到局部最优解。DNA 区域对 Baum-Welch 算法来说尤其成问题,因为它只有四个字母,而不是氨基酸字母表的二十个残基。
我们通过模拟研究和对 MIR 家族转座子的建模应用表明,最近引入的两种方法,条件 Baum-Welch 和动态模型手术,在一系列条件下可以更好地估计 profile HMM 参数。
我们认为这些新算法扩展了 profile HMM 在许多重要的 DNA 序列家族建模问题中的潜在应用范围,包括搜索和建模所有已知基因组中存在的类似病毒的转座子。