Zheng Chunfang, Santos Muñoz Daniella, Albert Victor A, Sankoff David
BMC Genomics. 2015;16 Suppl 10(Suppl 10):S8. doi: 10.1186/1471-2164-16-S10-S8. Epub 2015 Oct 2.
Following whole genome duplication (WGD), there is a compact distribution of gene similarities within the genome reflecting duplicate pairs of all the genes in the genome. With time, the distribution broadens and loses volume due to variable decay of duplicate gene similarity and to the process of duplicate gene loss. If there are two WGD, the older one becomes so reduced and broad that it merges with the tail of the distributions resulting from more recent events, and it becomes difficult to distinguish them. The goal of this paper is to advance statistical methods of identifying, or at least counting, the WGD events in the lineage of a given genome.
For a set of 15 angiosperm genomes, we analyze all 15 × 14 = 210 ordered pairs of target genome versus reference genome, using SynMap to find syntenic blocks. We consider all sets of B ≥ 2 syntenic blocks in the target genome that overlap in the reference genome as evidence of WGD activity in the target, whether it be one event or several. We hypothesize that in fitting an exponential function to the tail of the empirical distribution f (B) of block multiplicities, the size of the exponent will reflect the amount of WGD in the history of the target genome.
By amalgamating the results from all reference genomes, a range of values of SynMap parameters, and alternative cutoff points for the tail, we find a clear pattern whereby multiple-WGD core eudicots have the smallest (negative) exponents, followed by core eudicots with only the single "γ" triplication in their history, followed by a non-core eudicot with a single WGD, followed by the monocots, with a basal angiosperm, the WGD-free Amborella having the largest exponent.
The hypothesis that the exponent of the fit to the tail of the multiplicity distribution is a signature of the amount of WGD is verified, but there is also a clear complicating factor in the monocot clade, where a history of multiple WGD is not reflected in a small exponent.
在全基因组复制(WGD)之后,基因组内基因相似性呈紧密分布,反映了基因组中所有基因的重复对。随着时间推移,由于重复基因相似性的可变衰减以及重复基因丢失过程,这种分布会变宽并减少。如果存在两次WGD,那么较古老的那次会变得如此减少和宽泛,以至于它会与更近事件产生的分布尾部合并,从而难以区分它们。本文的目标是改进识别或至少统计给定基因组谱系中WGD事件的统计方法。
对于一组15个被子植物基因组,我们分析了目标基因组与参考基因组的所有15×14 = 210个有序对,使用SynMap来寻找共线性区域。我们将目标基因组中所有在参考基因组中重叠的B≥2个共线性区域的集合视为目标基因组中WGD活动的证据,无论这是一个事件还是多个事件。我们假设,在将指数函数拟合到共线性区域多重性的经验分布f(B)的尾部时,指数的大小将反映目标基因组历史中WGD的数量。
通过合并来自所有参考基因组的结果、一系列SynMap参数值以及尾部的替代截止点,我们发现了一个清晰的模式,即具有多次WGD的核心真双子叶植物具有最小(负)指数,其次是历史上只有一次“γ”三倍化的核心真双子叶植物,然后是具有一次WGD的非核心真双子叶植物,接着是单子叶植物,而基部被子植物、无WGD的无油樟具有最大指数。
验证了对多重性分布尾部拟合的指数是WGD数量特征的假设,但在单子叶植物分支中也存在一个明显的复杂因素,即多次WGD的历史并未在小指数中体现。