Pucker Boas, Holtgräwe Daniela, Weisshaar Bernd
Faculty of Biology & Center for Biotechnology, Bielefeld University, Bielefeld, Germany.
BMC Res Notes. 2017 Dec 4;10(1):667. doi: 10.1186/s13104-017-2985-y.
The Arabidopsis thaliana Niederzenz-1 genome sequence was recently published with an ab initio gene prediction. In depth analysis of the predicted gene set revealed some errors involving genes with non-canonical splice sites in their introns. Since non-canonical splice sites are difficult to predict ab initio, we checked for options to improve the annotation by transferring annotation information from the recently released Columbia-0 reference genome sequence annotation Araport11.
Incorporation of hints generated from Araport11 enabled the precise prediction of non-canonical splice sites. Manual inspection of RNA-Seq read mapping and RT-PCR were applied to validate the structural annotations of non-canonical splice sites. Predictions of untranslated regions were also updated by harnessing the potential of Araport11's information, which was generated by using high coverage RNA-Seq data. The improved gene set of the Nd-1 genome assembly (GeneSet_Nd-1_v1.1) was evaluated via comparison to the initial gene prediction (GeneSet_Nd-1_v1.0) as well as against Araport11 for the Col-0 reference genome sequence. GeneSet_Nd-1_v1.1 contains previously missed non-canonical splice sites in 1256 genes. Reciprocal best hits for 24,527 (89.4%) of all nuclear Col-0 genes against the GeneSet_Nd-1_v1.1 indicate a high gene prediction quality.
拟南芥 Niederzenz-1 基因组序列最近已发表,并带有从头开始的基因预测。对预测的基因集进行深入分析后发现了一些错误,这些错误涉及内含子中具有非规范剪接位点的基因。由于非规范剪接位点难以从头开始预测,我们检查了通过从最近发布的哥伦比亚-0 参考基因组序列注释 Araport11 转移注释信息来改进注释的选项。
纳入从 Araport11 生成的提示能够精确预测非规范剪接位点。应用 RNA-Seq 读段映射的人工检查和 RT-PCR 来验证非规范剪接位点的结构注释。还通过利用 Araport11 的信息潜力更新了非翻译区的预测,该信息是通过使用高覆盖度 RNA-Seq 数据生成的。通过与初始基因预测(GeneSet_Nd-1_v1.0)以及针对 Col-0 参考基因组序列的 Araport11 进行比较,评估了 Nd-1 基因组组装的改进基因集(GeneSet_Nd-1_v1.1)。GeneSet_Nd-1_v1.1 在 1256 个基因中包含先前遗漏的非规范剪接位点。所有核 Col-0 基因中的 24,527 个(89.4%)与 GeneSet_Nd-1_v1.1 的相互最佳匹配表明基因预测质量很高。