Department of Plant Biology, Michigan State University, East Lansing, MI, 48824, USA.
DOE Great Lake Bioenergy Research Center, Michigan State University, East Lansing, MI, 48824, USA.
BMC Genomics. 2021 Feb 2;22(1):99. doi: 10.1186/s12864-021-07397-5.
Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively.
To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements.
Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads and the generality of these causes and factors should be tested further in other species.
植物基因组序列的可用性带来了重大进展。然而,除了少数例外,绝大多数现有的基因组组装都是基于短读测序技术获得的,这些技术的读覆盖高度不均匀,表明测序和组装存在问题,这可能会对植物基因组的任何下游分析产生重大影响。例如,在番茄中,基于短读的组装有 0.6%(5.1Mb)和 9.7%(79.6Mb)的区域覆盖度显著高于背景,分别有 0.6%(5.1Mb)和 9.7%(79.6Mb)的区域覆盖度显著低于背景。
为了了解导致这种不均匀覆盖的原因,我们首先建立了能够预测具有可变覆盖度的基因组区域的机器学习模型,发现高覆盖度区域的简单重复序列和串联基因密度往往高于背景区域。为了确定高覆盖度区域是否存在组装错误,我们检查了最近可用的番茄长读组装,发现高覆盖度区域中有 27.8%(1.41Mb)可能是重复序列的错误组装,而背景区域中只有 1.4%。此外,使用能够区分正确和错误组装的高覆盖度区域的预测模型,我们发现错误组装的高覆盖度区域往往被简单重复序列、假基因和转座元件包围。
我们的研究提供了关于可变覆盖度区域的原因的见解,并对使用短读测序时导致植物基因组组装错误的因素进行了定量评估。当使用短读测序时,这些原因和因素应该在其他物种中进一步测试其普遍性。