Department of Microbiology and Immunology, Western University, London, ON, Canada.
Department of Pathology and Laboratory Medicine, Western University, London, ON, Canada.
PLoS Pathog. 2022 Feb 24;18(2):e1010331. doi: 10.1371/journal.ppat.1010331. eCollection 2022 Feb.
Gene overlap occurs when two or more genes are encoded by the same nucleotides. This phenomenon is found in all taxonomic domains, but is particularly common in viruses, where it may increase the information content of compact genomes or influence the creation of new genes. Here we report a global comparative study of overlapping open reading frames (OvRFs) of 12,609 virus reference genomes in the NCBI database. We retrieved metadata associated with all annotated open reading frames (ORFs) in each genome record to calculate the number, length, and frameshift of OvRFs. Our results show that while the number of OvRFs increases with genome length, they tend to be shorter in longer genomes. The majority of overlaps involve +2 frameshifts, predominantly found in dsDNA viruses. Antisense overlaps in which one of the ORFs was encoded in the same frame on the opposite strand (-0) tend to be longer. Next, we develop a new graph-based representation of the distribution of overlaps among the ORFs of genomes in a given virus family. In the absence of an unambiguous partition of ORFs by homology at this taxonomic level, we used an alignment-free k-mer based approach to cluster protein coding sequences by similarity. We connect these clusters with two types of directed edges to indicate (1) that constituent ORFs are adjacent in one or more genomes, and (2) that these ORFs overlap. These adjacency graphs not only provide a natural visualization scheme, but also a novel statistical framework for analyzing the effects of gene- and genome-level attributes on the frequencies of overlaps.
当两个或多个基因由相同的核苷酸编码时,就会发生基因重叠。这种现象存在于所有的分类领域,但在病毒中尤为常见,它可以增加紧凑基因组的信息含量,或者影响新基因的产生。在这里,我们报告了一个对 NCBI 数据库中 12609 个病毒参考基因组的重叠开放阅读框(OvRFs)的全球比较研究。我们检索了每个基因组记录中所有注释的开放阅读框(ORFs)相关的元数据,以计算 OvRFs 的数量、长度和移码。我们的结果表明,虽然 OvRFs 的数量随着基因组长度的增加而增加,但它们在较长的基因组中往往更短。大多数重叠涉及+2 移码,主要发生在双链 DNA 病毒中。反义重叠中,一个 ORF 以相同的框架编码在相反的链上(-0),往往更长。接下来,我们开发了一种新的基于图的表示方法,用于表示给定病毒家族中基因组的 ORFs 之间重叠的分布。在这个分类水平上,没有明确的同源性划分 ORFs 的情况下,我们使用无比对的基于 k-mer 的方法通过相似性来聚类蛋白质编码序列。我们用两种类型的有向边将这些聚类连接起来,以表示(1)组成 ORF 在一个或多个基因组中是相邻的,以及(2)这些 ORF 重叠。这些邻接图不仅提供了一种自然的可视化方案,而且为分析基因和基因组水平的属性对重叠频率的影响提供了一种新的统计框架。