Zhang Kun, Fan Wei, Deininger Prescott, Edwards Andrea, Xu Zujia, Zhu Dongxiao
Department of Computer Science, Xavier University of Louisiana, New Orleans, Louisiana 70125, USA.
Int J Comput Biol Drug Des. 2009;2(4):302-22. doi: 10.1504/IJCBDD.2009.030763. Epub 2009 Jan 4.
Insertion site characterisation of Alu elements is an important problem in primate-specific bioinformatics research. Key characteristics of this challenging problem include: data are not in the pre-defined feature vectors for predictive model construction; without any prior knowledge, can we discover the general patterns that could exist and also make biological insights?; how to obtain the compact yet discriminative patterns given a search space of 4(200)? This paper provides an integrated algorithmic framework for fulfilling the above mining tasks. Compared to the benchmark biological study, our results provide a further refined analysis of the patterns involved in Alu insertion. In particular, we acquire a 200nt predictive profile around the primary insertion site which not only contains the widely accepted consensus, but also suggests a longer pattern (T(7)AA[G'A]AATAA. This pattern provides more insight into the favourable sequence variations allowed for preferred binding and cleavage by the L1 ORF2 endonuclease. The proposed method is general enough that can be also applied to other sequence detection problems, such as microRNA target prediction.
Alu元件插入位点的特征描述是灵长类特异性生物信息学研究中的一个重要问题。这个具有挑战性的问题的关键特征包括:数据不是用于预测模型构建的预定义特征向量;在没有任何先验知识的情况下,我们能否发现可能存在的一般模式并获得生物学见解?;在4(200)的搜索空间下,如何获得紧凑且有区分性的模式?本文提供了一个用于完成上述挖掘任务的综合算法框架。与基准生物学研究相比,我们的结果对Alu插入所涉及的模式提供了进一步细化的分析。特别是,我们在主要插入位点周围获得了一个200nt的预测图谱,它不仅包含了广泛接受的共有序列,还暗示了一个更长的模式(T(7)AA[G'A]AATAA)。这种模式为L1 ORF2核酸内切酶优先结合和切割所允许的有利序列变异提供了更多见解。所提出的方法具有足够的通用性,也可应用于其他序列检测问题,如microRNA靶标预测。