Jaron Kamil S, Moravec Jiří C, Martínková Natália
Institute of Biostatistics and Analyses, Masaryk University and Institute of Vertebrate Biology, Academy of Sciences of the Czech Republic, Brno, Czech Republic.
Institute of Biostatistics and Analyses, Masaryk University and Institute of Vertebrate Biology, Academy of Sciences of the Czech Republic, Brno, Czech Republic Institute of Biostatistics and Analyses, Masaryk University and Institute of Vertebrate Biology, Academy of Sciences of the Czech Republic, Brno, Czech Republic.
Bioinformatics. 2014 Apr 15;30(8):1081-1086. doi: 10.1093/bioinformatics/btt727. Epub 2013 Dec 25.
Genomic islands (GIs) are DNA fragments incorporated into a genome through horizontal gene transfer (also called lateral gene transfer), often with functions novel for a given organism. While methods for their detection are well researched in prokaryotes, the complexity of eukaryotic genomes makes direct utilization of these methods unreliable, and so labour-intensive phylogenetic searches are used instead.
We present a surrogate method that investigates nucleotide base composition of the DNA sequence in a eukaryotic genome and identifies putative GIs. We calculate a genomic signature as a vector of tetranucleotide (4-mer) frequencies using a sliding window approach. Extending the neighbourhood of the sliding window, we establish a local kernel density estimate of the 4-mer frequency. We score the number of 4-mer frequencies in the sliding window that deviate from the credibility interval of their local genomic density using a newly developed discrete interval accumulative score (DIAS). To further improve the effectiveness of DIAS, we select informative 4-mers in a range of organisms using the tetranucleotide quality score developed herein. We show that the SigHunt method is computationally efficient and able to detect GIs in eukaryotic genomes that represent non-ameliorated integration. Thus, it is suited to scanning for change in organisms with different DNA composition.
Source code and scripts freely available for download at http://www.iba.muni.cz/index-en.php?pg=research-data-analysis-tools-sighunt are implemented in C and R and are platform-independent.
基因组岛(GIs)是通过水平基因转移(也称为侧向基因转移)整合到基因组中的DNA片段,通常具有对特定生物体而言全新的功能。虽然在原核生物中对其检测方法已有深入研究,但真核生物基因组的复杂性使得直接使用这些方法不可靠,因此转而采用劳动强度大的系统发育搜索方法。
我们提出了一种替代方法,该方法研究真核生物基因组中DNA序列的核苷酸碱基组成,并识别推定的基因组岛。我们使用滑动窗口方法计算基因组特征,将其作为四核苷酸(4聚体)频率的向量。扩展滑动窗口的邻域,我们建立了4聚体频率的局部核密度估计。我们使用新开发的离散区间累积得分(DIAS)对滑动窗口中偏离其局部基因组密度可信区间的4聚体频率数量进行评分。为了进一步提高DIAS的有效性,我们使用本文开发的四核苷酸质量得分在一系列生物体中选择信息丰富的4聚体。我们表明,SigHunt方法计算效率高,能够检测真核生物基因组中代表未改良整合的基因组岛。因此,它适用于扫描具有不同DNA组成的生物体中的变化。