Tong Hao, Schliekelman Paul, Mrázek Jan
Department of Statistics, University of Georgia, Athens, GA, 30602, USA.
Department of Microbiology and Institute of Bioinformatics, University of Georgia, Athens, GA, 30602, USA.
BMC Genomics. 2017 Jan 5;18(1):27. doi: 10.1186/s12864-016-3400-0.
DNA sequences contain repetitive motifs which have various functions in the physiology of the organism. A number of methods have been developed for discovery of such sequence motifs with a primary focus on detection of regulatory motifs and particularly transcription factor binding sites. Most motif-finding methods apply probabilistic models to detect motifs characterized by unusually high number of copies of the motif in the analyzed sequences.
We present a novel method for detection of pairs of motifs separated by spacers of variable nucleotide sequence but conserved length. Unlike existing methods for motif discovery, the motifs themselves are not required to occur at unusually high frequency but only to exhibit a significant preference to occur at a specific distance from each other. In the present implementation of the method, motifs are represented by pentamers and all pairs of pentamers are evaluated for statistically significant preference for a specific distance. An important step of the algorithm eliminates motif pairs where the spacers separating the two motifs exhibit a high degree of sequence similarity; such motif pairs likely arise from duplications of the whole segment including the motifs and the spacer rather than due to selective constraints indicative of a functional importance of the motif pair. The method was used to scan 569 complete prokaryotic genomes for novel sequence motifs. Some motifs detected were previously known but other motifs found in the search appear to be novel. Selected motif pairs were subjected to further investigation and in some cases their possible biological functions were proposed.
We present a new motif-finding technique that is applicable to scanning complete genomes for sequence motifs. The results from analysis of 569 genomes suggest that the method detects previously known motifs that are expected to be found as well as new motifs that are unlikely to be discovered by traditional motif-finding methods. We conclude that our approach to detection of significant motif pairs can complement existing motif-finding techniques in discovery of novel functional sequence motifs in complete genomes.
DNA序列包含重复基序,这些基序在生物体生理过程中具有多种功能。已经开发了许多方法来发现此类序列基序,主要侧重于检测调控基序,特别是转录因子结合位点。大多数基序查找方法应用概率模型来检测基序,其特征是在所分析的序列中基序的拷贝数异常高。
我们提出了一种新方法,用于检测由可变核苷酸序列但长度保守的间隔区隔开的基序对。与现有的基序发现方法不同,基序本身不需要以异常高的频率出现,而只需要表现出在彼此特定距离处出现的显著偏好。在该方法的当前实现中,基序由五聚体表示,并且评估所有五聚体对在特定距离上的统计显著偏好。该算法的一个重要步骤是消除间隔区隔开两个基序的基序对,其中间隔区表现出高度的序列相似性;这样的基序对可能来自包括基序和间隔区的整个片段的重复,而不是由于表明基序对功能重要性的选择性约束。该方法用于扫描569个完整的原核生物基因组以寻找新的序列基序。检测到的一些基序以前是已知的,但在搜索中发现的其他基序似乎是新的。对选定的基序对进行了进一步研究,在某些情况下还提出了它们可能的生物学功能。
我们提出了一种新的基序查找技术,适用于扫描完整基因组以寻找序列基序。对569个基因组的分析结果表明,该方法检测到了预期会发现的先前已知的基序以及传统基序查找方法不太可能发现的新基序。我们得出结论,我们检测显著基序对的方法可以在发现完整基因组中的新功能序列基序方面补充现有的基序查找技术。