Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia V5A 1S6, Canada.
Genome Res. 2012 Aug;22(8):1567-80. doi: 10.1101/gr.134601.111. Epub 2012 Jul 6.
Curation of a high-quality gene set is the critical first step in genome research, enabling subsequent analyses such as ortholog assignment, cis-regulatory element finding, and synteny detection. In this project, we have reannotated the genome of Caenorhabditis briggsae, the best studied sister species of the model organism Caenorhabditis elegans. First, we applied a homology-based gene predictor genBlastG to annotate the C. briggsae genome. We then validated and further improved the C. briggsae gene annotation through RNA-seq analysis of the C. briggsae transcriptome, which resulted in the first validated C. briggsae gene set (23,159 genes), among which 7347 genes (33.9% of all genes with introns) have all of their introns confirmed. Most genes (14,812, or 68.3%) have at least one intron validated, compared with only 3.9% in the most recent WormBase release (WS228). Of all introns in the revised gene set (103,083), 61,503 (60.1%) have been confirmed. Additionally, we have identified numerous trans-splicing leaders (SL1 and SL2 variants) in C. briggsae, leading to the first genome-wide annotation of operons in C. briggsae (1105 operons). The majority of the annotated operons (564, or 51.0%) are perfectly conserved in C. elegans, with an additional 345 operons (or 31.2%) somewhat divergent. Additionally, RNA-seq analysis revealed over 10 thousand small-size assembly errors in the current C. briggsae reference genome that can be readily corrected. The revised C. briggsae genome annotation represents a solid platform for comparative genomics analysis and evolutionary studies of Caenorhabditis species.
高质量基因集的构建是基因组研究的关键第一步,它可以支持后续的分析,如直系同源基因的分配、顺式调控元件的发现和同线性检测。在这个项目中,我们重新注释了 Caenorhabditis briggsae 的基因组,这是模式生物 Caenorhabditis elegans 的最佳研究姐妹种。首先,我们应用基于同源性的基因预测器 genBlastG 来注释 C. briggsae 基因组。然后,我们通过对 C. briggsae 转录组的 RNA-seq 分析来验证和进一步改进 C. briggsae 基因注释,从而得到了第一个经过验证的 C. briggsae 基因集(23159 个基因),其中 7347 个基因(所有含内含子基因的 33.9%)的所有内含子都得到了确认。与最近的 WormBase 版本(WS228)中只有 3.9%的基因相比,大多数基因(14812 个,或 68.3%)至少有一个内含子得到了验证。在修订后的基因集中,所有内含子(103083 个)中有 61503 个(60.1%)得到了确认。此外,我们还在 C. briggsae 中鉴定了许多反式剪接的前导序列(SL1 和 SL2 变体),从而首次对 C. briggsae 中的操纵子进行了全基因组注释(1105 个操纵子)。注释的操纵子中,大多数(564 个,或 51.0%)在 C. elegans 中完全保守,另有 345 个操纵子(或 31.2%)略有差异。此外,RNA-seq 分析还揭示了当前 C. briggsae 参考基因组中超过 10000 个小尺寸组装错误,这些错误可以很容易地纠正。修订后的 C. briggsae 基因组注释为比较基因组学分析和 Caenorhabditis 物种的进化研究提供了一个可靠的平台。