Department of Mathematics and Statistics, University of Ottawa Ottawa, Canada K1N 6N5.
BMC Bioinformatics. 2011 Oct 5;12 Suppl 9(Suppl 9):S5. doi: 10.1186/1471-2105-12-S9-S5.
Paralog reduction, the loss of duplicate genes after whole genome duplication (WGD) is a pervasive process. Whether this loss proceeds gene by gene or through deletion of multi-gene DNA segments is controversial, as is the question of fractionation bias, namely whether one homeologous chromosome is more vulnerable to gene deletion than the other.
As a null hypothesis, we first assume deletion events, on one homeolog only, excise a geometrically distributed number of genes with unknown mean µ, and these events combine to produce deleted runs of length l, distributed approximately as a negative binomial with unknown parameter r, itself a random variable with distribution π(·). A more realistic model requires deletion events on both homeologs distributed as a truncated geometric. We simulate the distribution of run lengths l in both models, as well as the underlying π(r), as a function of µ, and show how sampling l allows us to estimate µ. We apply this to data on a total of 15 genomes descended from 6 distinct WGD events and show how to correct the bias towards shorter runs caused by genome rearrangements. Because of the difficulty in deriving π(·) analytically, we develop a deterministic recurrence to calculate each π(r) as a function of µ and the proportion of unreduced paralog pairs.
The parameter µ can be estimated based on run lengths of single-copy regions. Estimates of µ in real data do not exclude the possibility that duplicate gene deletion is largely gene by gene, although it may sometimes involve longer segments.
直系同源基因(Paralog)的减少,即全基因组复制(Whole Genome Duplication,WGD)后重复基因的丢失,是一个普遍存在的过程。这种丢失是逐个基因进行的,还是通过多基因 DNA 片段的删除进行的,以及片段化偏倚的问题,即同源染色体的一条是否比另一条更容易发生基因缺失,这些问题都存在争议。
作为一个零假设,我们首先假设在一个同源染色体上发生的删除事件,删除了一个具有未知均值 µ 的几何分布的基因数量,这些事件组合在一起产生了长度为 l 的缺失片段,这些片段的分布近似于未知参数 r 的负二项分布,r 本身也是一个具有分布 π(·)的随机变量。一个更现实的模型需要两个同源染色体上的删除事件分布为截断的几何分布。我们模拟了这两种模型中片段长度 l 的分布,以及潜在的 π(r),作为 µ 的函数,并展示了如何通过抽样 l 来估计 µ。我们将此应用于总共 15 个源自 6 个不同 WGD 事件的基因组的数据,并展示了如何纠正由于基因组重排而导致的较短片段的偏差。由于难以从解析上推导出 π(·),我们开发了一个确定的递归算法来计算每个 π(r)作为 µ 和未减少的直系同源基因对比例的函数。
可以根据单拷贝区域的片段长度来估计参数 µ。实际数据中 µ 的估计值并不排除重复基因缺失主要是逐个基因进行的可能性,尽管有时可能涉及更长的片段。