Deem Kevin D, Brisson Jennifer A
Department of Biology, University of Rochester, Rochester, NY, 14627.
bioRxiv. 2025 May 13:2025.05.08.652899. doi: 10.1101/2025.05.08.652899.
Reliable genome annotation is crucial for analyses of gene function, conservation, and evolution. Factors such as the sequencing technology used to create the assembly and the amount of duplicated sequence within the genome of interest can have a large impact on the quality of gene annotations. In particular, short read-based assemblies tend to mis-assemble duplicated genes as single loci, a problem that requires additional long read sequencing to resolve. Pea aphids exhibit a high level of gene duplication, resulting in mis-assembly and mis-annotation of genes in the short read reference genome. Here, we re-annotate the pea aphid reference genome, along with two long read pea aphid genomes, to facilitate future analyses of gene duplication and function in pea aphids. We use an integrated approach, consolidating both and RNAseq-based annotations into unified gene models. The new annotations contain genes that were missing, mis-annotated, or mis-assembled in the reference, and are generally consistent across assemblies, showing very good agreement between the long read assemblies. Our annotation method is sensitive enough to refine existing gene models, uncovering alternatively used promoters and isoforms, and aids in finding gene duplications. These data provide a useful supplement to the existing reference annotations and a new comparative framework for discovery and analysis of gene function and duplication in this important emerging model insect.
可靠的基因组注释对于基因功能、保守性和进化分析至关重要。诸如用于创建组装的测序技术以及感兴趣基因组内重复序列的数量等因素,可能会对基因注释的质量产生重大影响。特别是,基于短读长的组装往往会将重复基因错误地组装为单个位点,这个问题需要额外的长读长测序来解决。豌豆蚜表现出高水平的基因重复,导致短读长参考基因组中的基因出现错误组装和错误注释。在这里,我们对豌豆蚜参考基因组以及两个长读长豌豆蚜基因组进行重新注释,以促进未来对豌豆蚜基因重复和功能的分析。我们采用一种综合方法,将基于 和RNAseq的注释整合到统一的基因模型中。新的注释包含参考基因组中缺失、错误注释或错误组装的基因,并且在各个组装之间总体上是一致的,在长读长组装之间显示出非常好的一致性。我们的注释方法足够灵敏,能够完善现有的基因模型,发现交替使用的启动子和异构体,并有助于发现基因重复。这些数据为现有的参考注释提供了有用的补充,并为这个重要的新兴模式昆虫中基因功能和重复的发现与分析提供了一个新的比较框架。