Institute of Experimental Botany, Academy of Sciences of the Czech Republic, Prague, Czech Republic; Department of Experimental Plant Biology, Faculty of Science, Charles University in Prague, Prague, Czech Republic.
Department of Experimental Plant Biology, Faculty of Science, Charles University in Prague, Prague, Czech Republic.
PLoS One. 2014 Apr 11;9(4):e94077. doi: 10.1371/journal.pone.0094077. eCollection 2014.
Repetitive sequences present a challenge for genome sequence assembly, and highly similar segmental duplications may disappear from assembled genome sequences. Having found a surprising lack of observable phenotypic deviations and non-Mendelian segregation in Arabidopsis thaliana mutants in SEC10, a gene encoding a core subunit of the exocyst tethering complex, we examined whether this could be explained by a hidden gene duplication. Re-sequencing and manual assembly of the Arabidopsis thaliana SEC10 (At5g12370) locus revealed that this locus, comprising a single gene in the reference genome assembly, indeed contains two paralogous genes in tandem, SEC10a and SEC10b, and that a sequence segment of 7 kb in length is missing from the reference genome sequence. Differences between the two paralogs are concentrated in non-coding regions, while the predicted protein sequences exhibit 99% identity, differing only by substitution of five amino acid residues and an indel of four residues. Both SEC10 genes are expressed, although varying transcript levels suggest differential regulation. Homozygous T-DNA insertion mutants in either paralog exhibit a wild-type phenotype, consistent with proposed extensive functional redundancy of the two genes. By these observations we demonstrate that recently duplicated genes may remain hidden even in well-characterized genomes, such as that of A. thaliana. Moreover, we show that the use of the existing A. thaliana reference genome sequence as a guide for sequence assembly of new Arabidopsis accessions or related species has at least in some cases led to error propagation.
重复序列给基因组序列组装带来了挑战,高度相似的片段重复可能会从组装的基因组序列中消失。我们发现拟南芥 SEC10 突变体中存在一个令人惊讶的现象,即没有可观察到的表型偏差和非孟德尔分离,SEC10 是一种编码外泌体连接复合物核心亚基的基因,我们检查了这是否可以用隐藏的基因重复来解释。重新测序和手动组装拟南芥 SEC10(At5g12370)基因座揭示了该基因座,在参考基因组组装中包含一个单一基因,实际上包含两个串联的直系同源基因,SEC10a 和 SEC10b,并且参考基因组序列中缺少 7kb 长的序列片段。两个直系同源物之间的差异集中在非编码区域,而预测的蛋白质序列具有 99%的同一性,仅通过替换五个氨基酸残基和四个残基的插入/缺失而有所不同。两个 SEC10 基因都有表达,尽管转录水平的差异表明存在差异调控。两个基因的纯合 T-DNA 插入突变体均表现出野生型表型,这与两个基因广泛的功能冗余一致。通过这些观察,我们证明了即使在特征良好的基因组中,如拟南芥基因组,最近复制的基因也可能仍然隐藏。此外,我们表明,在某些情况下,使用现有的拟南芥参考基因组序列作为新拟南芥品系或相关物种序列组装的指南至少导致了错误的传播。