Newcomb Garin, Sayood Khalid
Department of Electrical and Computer Engineering, University of Nebraska, Lincoln, NE 68588-0511, USA.
Entropy (Basel). 2021 Oct 11;23(10):1324. doi: 10.3390/e23101324.
One of the important steps in the annotation of genomes is the identification of regions in the genome which code for proteins. One of the tools used by most annotation approaches is the use of signals extracted from genomic regions that can be used to identify whether the region is a protein coding region. Motivated by the fact that these regions are information bearing structures we propose signals based on measures motivated by the average mutual information for use in this task. We show that these signals can be used to identify coding and noncoding sequences with high accuracy. We also show that these signals are robust across species, phyla, and kingdom and can, therefore, be used in species agnostic genome annotation algorithms for identifying protein coding regions. These in turn could be used for gene identification.
基因组注释中的一个重要步骤是识别基因组中编码蛋白质的区域。大多数注释方法使用的工具之一是利用从基因组区域提取的信号,这些信号可用于识别该区域是否为蛋白质编码区域。鉴于这些区域是承载信息的结构,我们基于平均互信息所激发的度量提出了用于此任务的信号。我们表明,这些信号可用于高精度地识别编码和非编码序列。我们还表明,这些信号在物种、门和界之间具有稳健性,因此可用于不依赖物种的基因组注释算法来识别蛋白质编码区域。这些反过来又可用于基因识别。