Department of Mathematics and Statistics, University of Ottawa, 150 Louis Pasteur pvt, Ottawa, K1N 6N5, Canada.
BMC Bioinformatics. 2019 Dec 17;20(Suppl 20):635. doi: 10.1186/s12859-019-3202-x.
A basic tool for studying the polyploidization history of a genome, especially in plants, is the distribution of duplicate gene similarities in syntenically aligned regions of a genome. This distribution can usually be decomposed into two or more components identifiable by peaks, or local maxima, each representing a different polyploidization event. The distributions may be generated by means of a discrete time branching process, followed by a sequence divergence model. The branching process, as well as the inference of fractionation rates based on it, requires knowledge of the ploidy level of each event, which cannot be directly inferred from the pair similarity distribution.
For a sequence of two events of unknown ploidy, either tetraploid, giving rise to whole genome doubling (WGD), or hexaploid, giving rise to whole genome tripling (WGT), we base our analysis on triples of similar genes. We calculate the probability of the four triplet types with origins in one or the other event, or both, and impose a mutational model so that the distribution resembles the original data. Using a ML transition point in the similarities between the two events as a discriminator for the hypothesized origin of each similarity, we calculate the predicted number of triplets of each type for each model combining WGT and/or WGD. This yields a predicted profile of triplet types for each model. We compare the observed and predicted triplet profiles for each model to confirm the polyploidization history of durian, poplar and cabbage.
We have developed a way of inferring the ploidy of up to three successive WGD and/or WGT events by estimating the time of origin of each of the similarities in triples of genes. This may be generalized to a larger number of events and to higher ploidies.
研究基因组多倍化历史的基本工具,特别是在植物中,是在基因组的同线性排列区域中重复基因相似性的分布。这种分布通常可以分解为两个或更多可识别的峰或局部最大值的成分,每个峰代表不同的多倍化事件。这些分布可以通过离散时间分支过程和随后的序列分歧模型生成。分支过程以及基于它的分馏率推断需要了解每个事件的倍性水平,而不能直接从对相似性分布推断出来。
对于两个未知倍性的事件序列,要么是四倍体,导致全基因组加倍(WGD),要么是六倍体,导致全基因组三倍化(WGT),我们的分析基于类似基因的三个一组。我们计算了起源于一个或另一个事件或两个事件的四种三基因类型的概率,并施加了一个突变模型,以使分布类似于原始数据。使用两个事件之间相似性的 ML 跃迁点作为假设的每个相似性起源的判别器,我们计算了每个模型结合 WGT 和/或 WGD 的每种类型的预测三基因数量。这为每个模型产生了预测的三基因类型分布。我们比较了每个模型的观察和预测的三基因类型分布,以确认榴莲、杨树和白菜的多倍化历史。
我们已经开发出一种通过估计三基因对中每个相似性的起源时间来推断多达三个连续的 WGD 和/或 WGT 事件的倍性的方法。这可以推广到更多的事件和更高的倍性。