蛋白质编码序列的古老进化信号可用于发现果蝇基因组中的新基因。

Ancient evolutionary signals of protein-coding sequences allow the discovery of new genes in the Drosophila melanogaster genome.

机构信息

Centro Andaluz de Biologia del Desarrollo (CABD, UPO-CSIC-JA). Facultad de Ciencias Experimentales (Área de Genética), Universidad Pablo de Olavide, 41013, Sevilla, Spain.

出版信息

BMC Genomics. 2020 Mar 5;21(1):210. doi: 10.1186/s12864-020-6632-y.

DOI:10.1186/s12864-020-6632-y

PMID:32138644

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7059364/

Abstract

BACKGROUND

The current growth in DNA sequencing techniques makes of genome annotation a crucial task in the genomic era. Traditional gene finders focus on protein-coding sequences, but they are far from being exhaustive. The number of this kind of genes continuously increases due to new experimental data and development of improved bioinformatics algorithms.

RESULTS

In this context, AnABlast represents a novel in silico strategy, based on the accumulation of short evolutionary signals identified by protein sequence alignments of low score. This strategy potentially highlights protein-coding regions in genomic sequences regardless of traditional homology or translation signatures. Here, we analyze the evolutionary information that the accumulation of these short signals encloses. Using the Drosophila melanogaster genome, we stablish optimal parameters for the accurate gene prediction with AnABlast and show that this new strategy significantly contributes to add genes, exons and pseudogenes regions, yet to be discovered in both already annotated and new genomes.

CONCLUSIONS

AnABlast can be freely used to analyze genomic regions of whole genomes where it contributes to complete the previous annotation.

摘要

背景

当前 DNA 测序技术的发展使得基因组注释成为基因组时代的一项关键任务。传统的基因预测器专注于蛋白质编码序列，但它们远非详尽无遗。由于新的实验数据和改进的生物信息学算法的发展，这种基因的数量不断增加。

结果

在这种情况下，AnABlast 代表了一种新的基于低得分蛋白质序列比对中识别的短进化信号积累的计算策略。这种策略可以潜在地突出基因组序列中的蛋白质编码区域，而无需传统的同源性或翻译特征。在这里，我们分析了这些短信号积累所包含的进化信息。使用黑腹果蝇基因组，我们为 AnABlast 进行准确基因预测建立了最佳参数，并表明这种新策略显著有助于发现已经注释和新基因组中尚未发现的基因、外显子和假基因区域。