Department of Genetics Development and Cell Biology, Iowa State University, Ames, IA, 50010, USA.
Center for Metabolic Biology, Iowa State University, Ames, IA, 50011, USA.
BMC Bioinformatics. 2019 Aug 27;20(1):440. doi: 10.1186/s12859-019-3023-y.
With every new genome that is sequenced, thousands of species-specific genes (orphans) are found, some originating from ultra-rapid mutations of existing genes, many others originating de novo from non-genic regions of the genome. If some of these genes survive across speciations, then extant organisms will contain a patchwork of genes whose ancestors first appeared at different times. Standard phylostratigraphy, the technique of partitioning genes by their age, is based solely on protein similarity algorithms. However, this approach relies on negative evidence ─ a failure to detect a homolog of a query gene. An alternative approach is to limit the search for homologs to syntenic regions. Then, genes can be positively identified as de novo orphans by tracing them to non-coding sequences in related species.
We have developed a synteny-based pipeline in the R framework. Fagin determines the genomic context of each query gene in a focal species compared to homologous sequence in target species. We tested the fagin pipeline on two focal species, Arabidopsis thaliana (plus four target species in Brassicaseae) and Saccharomyces cerevisiae (plus six target species in Saccharomyces). Using microsynteny maps, fagin classified the homology relationship of each query gene against each target genome into three main classes, and further subclasses: AAic (has a coding syntenic homolog), NTic (has a non-coding syntenic homolog), and Unknown (has no detected syntenic homolog). fagin inferred over half the "Unknown" A. thaliana query genes, and about 20% for S. cerevisiae, as lacking a syntenic homolog because of local indels or scrambled synteny.
fagin augments standard phylostratigraphy, and extends synteny-based phylostratigraphy with an automated, customizable, and detailed contextual analysis. By comparing synteny-based phylostrata to standard phylostrata, fagin systematically identifies those orphans and lineage-specific genes that are well-supported to have originated de novo. Analyzing within-species genomes should distinguish orphan genes that may have originated through rapid divergence from de novo orphans. Fagin also delineates whether a gene has no syntenic homolog because of technical or biological reasons. These analyses indicate that some orphans may be associated with regions of high genomic perturbation.
随着每一个新基因组序列的完成,都会发现成千上万的物种特异性基因(孤儿基因),其中一些来自现有基因的超快速突变,而另一些则来自基因组中非基因区域的全新起源。如果这些基因中的一些在物种形成过程中存活下来,那么现存的生物体将包含一个由其祖先在不同时间出现的基因拼凑而成的嵌合体。标准的系统发生发生分类学(基于基因年龄对基因进行划分的技术)完全基于蛋白质相似性算法。然而,这种方法依赖于否定证据——未能检测到查询基因的同源物。另一种方法是将同源基因的搜索限制在同线性区域内。然后,可以通过将基因追溯到相关物种的非编码序列,将其作为全新的孤儿基因进行阳性鉴定。
我们在 R 框架中开发了一种基于同线性的管道。Fagin 确定了焦点物种中每个查询基因相对于目标物种同源序列的基因组上下文。我们在两个焦点物种(拟南芥(加上四个十字花科目标物种)和酿酒酵母(加上六个酿酒酵母目标物种)上测试了 fagin 管道。使用微同线性图谱,fagin 将每个查询基因与每个目标基因组的同源关系分类为三个主要类别和进一步的子类:AAic(具有编码同线性同源物)、NTic(具有非编码同线性同源物)和未知(未检测到同线性同源物)。fagin 推断出超过一半的“未知”拟南芥查询基因,以及大约 20%的酿酒酵母查询基因,由于局部缺失或乱序同线性,没有检测到同线性同源物。
fagin 增强了标准的系统发生发生分类学,并通过自动化、可定制和详细的上下文分析扩展了基于同线性的系统发生发生分类学。通过比较基于同线性的系统发生发生分类学和标准的系统发生发生分类学,fagin 系统地识别出那些被很好地支持为全新起源的孤儿基因和谱系特异性基因。分析种内基因组应该可以区分可能由于快速分化而起源的孤儿基因和全新起源的孤儿基因。Fagin 还确定了一个基因是否由于技术或生物学原因而没有同线性同源物。这些分析表明,一些孤儿基因可能与基因组高度扰动区域有关。