School of Science, Southwest University of Science and Technology, 59 Qinglong Road, Mianyang, Sichuan Province, 621010, People's Republic of China.
Physics and Biology Unit, Okinawa Institute of Science and Technology Graduate University, 1919-1 Tancha, Onna-son, Kunigami-gun, Okinawa, 904-0495, Japan.
BMC Bioinformatics. 2020 Feb 6;21(1):48. doi: 10.1186/s12859-020-3384-2.
The evolutionary history of genes serves as a cornerstone of contemporary biology. Most conserved sequences in mammalian genomes don't code for proteins, yielding a need to infer evolutionary history of sequences irrespective of what kind of functional element they may encode. Thus, sequence-, as opposed to gene-, centric modes of inferring paths of sequence evolution are increasingly relevant. Customarily, homologous sequences derived from the same direct ancestor, whose ancestral position in two genomes is usually conserved, are termed "primary" (or "positional") orthologs. Methods based solely on similarity don't reliably distinguish primary orthologs from other homologs; for this, genomic context is often essential. Context-dependent identification of orthologs traditionally relies on genomic context over length scales characteristic of conserved gene order or whole-genome sequence alignment, and can be computationally intensive.
We demonstrate that short-range sequence context-as short as a single "maximal" match- distinguishes primary orthologs from other homologs across whole genomes. On mammalian whole genomes not preprocessed by repeat-masker, potential orthologs are extracted by genome intersection as "non-nested maximal matches:" maximal matches that are not nested into other maximal matches. It emerges that on both nucleotide and gene scales, non-nested maximal matches recapitulate primary or positional orthologs with high precision and high recall, while the corresponding computation consumes less than one thirtieth of the computation time required by commonly applied whole-genome alignment methods. In regions of genomes that would be masked by repeat-masker, non-nested maximal matches recover orthologs that are inaccessible to Lastz net alignment, for which repeat-masking is a prerequisite. mmRBHs, reciprocal best hits of genes containing non-nested maximal matches, yield novel putative orthologs, e.g. around 1000 pairs of genes for human-chimpanzee.
We describe an intersection-based method that requires neither repeat-masking nor alignment to infer evolutionary history of sequences based on short-range genomic sequence context. Ortholog identification based on non-nested maximal matches is parameter-free, and less computationally intensive than many alignment-based methods. It is especially suitable for genome-wide identification of orthologs, and may be applicable to unassembled genomes. We are agnostic as to the reasons for its effectiveness, which may reflect local variation of mean mutation rate.
基因的进化历史是当代生物学的基石。哺乳动物基因组中大多数保守序列不编码蛋白质,因此需要推断序列的进化历史,而不管它们可能编码什么样的功能元件。因此,基于序列而非基因的推断序列进化路径的模式变得越来越重要。通常,来自同一直接祖先的同源序列,其在两个基因组中的祖先位置通常是保守的,被称为“原始”(或“位置”)直系同源物。仅基于相似性的方法不能可靠地区分原始直系同源物和其他同源物;为此,基因组上下文通常是必不可少的。基于上下文的直系同源物的识别传统上依赖于基因组上下文,而不是保守基因顺序或全基因组序列比对的特征长度尺度,并且计算量很大。
我们证明,短程序列上下文——短至单个“最大”匹配——可以区分整个基因组中的原始直系同源物和其他同源物。在未经重复屏蔽处理的哺乳动物全基因组上,通过基因组交集提取潜在的直系同源物作为“非嵌套最大匹配”:不是嵌套在其他最大匹配中的最大匹配。结果表明,在核苷酸和基因尺度上,非嵌套最大匹配以高精度和高召回率重现原始或位置直系同源物,而相应的计算消耗的时间不到常用全基因组比对方法所需时间的三十分之一。在基因组中会被重复屏蔽器屏蔽的区域,非嵌套最大匹配可以恢复无法通过 Lastz 网络比对访问的直系同源物,而重复屏蔽是其前提条件。包含非嵌套最大匹配的基因的 mmRBHs(reciprocal best hits)产生新的推定直系同源物,例如人类-黑猩猩约有 1000 对基因。
我们描述了一种基于交集的方法,该方法既不需要重复屏蔽也不需要比对来推断基于短程基因组序列上下文的序列进化历史。基于非嵌套最大匹配的同源物识别是无参数的,并且比许多基于比对的方法计算量更小。它特别适用于全基因组范围内的同源物识别,并且可能适用于未组装的基因组。我们对其有效性的原因持不可知态度,这可能反映了平均突变率的局部变化。