Evans Scott C, Kourtidis Antonis, Markham T Stephen, Miller Jonathan, Conklin Douglas S, Torres Andrew S
GE Global Research, One Research Circle, Niskayuna, NY 12309, USA.
EURASIP J Bioinform Syst Biol. 2007;2007(1):43670. doi: 10.1155/2007/43670.
We describe initial results of miRNA sequence analysis with the optimal symbol compression ratio (OSCR) algorithm and recast this grammar inference algorithm as an improved minimum description length (MDL) learning tool: MDLcompress. We apply this tool to explore the relationship between miRNAs, single nucleotide polymorphisms (SNPs), and breast cancer. Our new algorithm outperforms other grammar-based coding methods, such as DNA Sequitur, while retaining a two-part code that highlights biologically significant phrases. The deep recursion of MDLcompress, together with its explicit two-part coding, enables it to identify biologically meaningful sequence without needlessly restrictive priors. The ability to quantify cost in bits for phrases in the MDL model allows prediction of regions where SNPs may have the most impact on biological activity. MDLcompress improves on our previous algorithm in execution time through an innovative data structure, and in specificity of motif detection (compression) through improved heuristics. An MDLcompress analysis of 144 over expressed genes from the breast cancer cell line BT474 has identified novel motifs, including potential microRNA (miRNA) binding sites that are candidates for experimental validation.
我们描述了使用最优符号压缩率(OSCR)算法进行miRNA序列分析的初步结果,并将此语法推断算法重塑为一种改进的最小描述长度(MDL)学习工具:MDLcompress。我们应用此工具来探索miRNA、单核苷酸多态性(SNP)与乳腺癌之间的关系。我们的新算法优于其他基于语法的编码方法,如DNA Sequitur,同时保留了突出生物学重要短语的两部分编码。MDLcompress的深度递归及其明确的两部分编码,使其能够识别生物学上有意义的序列,而无需不必要的严格先验条件。在MDL模型中对短语进行比特成本量化的能力允许预测SNP可能对生物活性产生最大影响的区域。MDLcompress通过创新的数据结构在执行时间上改进了我们之前的算法,并通过改进的启发式方法在基序检测(压缩)的特异性方面有所提升。对乳腺癌细胞系BT474中144个过表达基因进行的MDLcompress分析已识别出新型基序,包括作为实验验证候选的潜在微小RNA(miRNA)结合位点。