Basu Analabha, Chaudhuri Probal, Majumder Partha P
Human Genetics Unit, Indian Statistical Institute, Kolkata, 700108 India.
Genome Res. 2005 Jan;15(1):67-77. doi: 10.1101/gr.2358005.
The problem of identifying motifs comprising nucleotides at a set of polymorphic DNA sites, not necessarily contiguous, arises in many human genetic problems. However, when the sites are not contiguous, no efficient algorithm exists for polymorphic motif identification. A search based on complete enumeration is computationally inefficient. We have developed probabilistic search algorithms to discover motifs of known or unknown lengths. We have developed statistical tests of significance for assessing a motif discovery, and a statistical criterion for simultaneously estimating motif length and discovering it. We have tested these algorithms on various synthetic data sets and have shown that they are very efficient, in the sense that the "true" motifs can be detected in the vast majority of replications and in a small number of iterations. Additionally, we have applied them to some real data sets and have shown that they are able to identify known motifs. In certain applications, it is pertinent to find motifs that contain contrasting nucleotides at the sites included in the motif (e.g., motifs identified in case-control association studies). For this, we have suggested appropriate modifications. Using simulations, we have discovered that the success rate of identification of the correct motif is high in case-control studies except when relative risks are small. Our analyses of evolutionary data sets resulted in the identification of some motifs that appear to have important implications on human evolutionary inference. These algorithms can easily be implemented to discover motifs from multilocus genotype data by simple numerical recoding of genotypes.
在许多人类遗传学问题中,都会出现识别由一组多态性DNA位点(不一定是连续的)上的核苷酸组成的基序的问题。然而,当这些位点不连续时,不存在用于多态性基序识别的有效算法。基于完全枚举的搜索在计算上效率低下。我们开发了概率搜索算法来发现已知或未知长度的基序。我们开发了用于评估基序发现的显著性统计检验,以及用于同时估计基序长度并发现它的统计标准。我们在各种合成数据集上测试了这些算法,并表明它们非常高效,即能够在绝大多数重复中且在少数迭代中检测到“真实”基序。此外,我们将它们应用于一些真实数据集,并表明它们能够识别已知基序。在某些应用中,找到在基序所包含的位点上含有对比核苷酸的基序(例如,在病例对照关联研究中识别出的基序)是相关的。为此,我们提出了适当的修改。通过模拟,我们发现除了相对风险较小时,在病例对照研究中正确基序的识别成功率很高。我们对进化数据集的分析导致识别出一些似乎对人类进化推断有重要意义的基序。通过对基因型进行简单的数字重新编码,这些算法可以很容易地实现从多位点基因型数据中发现基序。