Department of Aeronautics and Astronautics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
Bioinformatics. 2012 Jun 1;28(11):1446-54. doi: 10.1093/bioinformatics/bts155. Epub 2012 Apr 5.
Sequence analysis algorithms are often applied to sets of DNA, RNA or protein sequences to identify common or distinguishing features. Controlling for sequence length variation is critical to properly score sequence features and identify true biological signals rather than length-dependent artifacts.
Several cis-regulatory module discovery algorithms exhibit a substantial dependence between DNA sequence score and sequence length. Our newly developed LOESS method is flexible in capturing diverse score-length relationships and is more effective in correcting DNA sequence scores for length-dependent artifacts, compared with four other approaches. Application of this method to genes co-expressed during Drosophila melanogaster embryonic mesoderm development or neural development scored by the Lever motif analysis algorithm resulted in successful recovery of their biologically validated cis-regulatory codes. The LOESS length-correction method is broadly applicable, and may be useful not only for more accurate inference of cis-regulatory codes, but also for detection of other types of patterns in biological sequences.
Source code and compiled code are available from http://thebrain.bwh.harvard.edu/LM_LOESS/
序列分析算法通常应用于一组 DNA、RNA 或蛋白质序列,以识别常见或区别特征。控制序列长度变化对于正确评分序列特征和识别真正的生物信号而不是长度相关的伪影至关重要。
几种顺式调控模块发现算法显示 DNA 序列得分与序列长度之间存在显著的相关性。与其他四种方法相比,我们新开发的 LOESS 方法在捕捉不同的得分-长度关系方面更加灵活,并且在纠正 DNA 序列得分的长度相关伪影方面更加有效。将该方法应用于果蝇胚胎中胚层发育或神经发育过程中共同表达的基因,通过 Lever 基序分析算法进行评分,成功地恢复了其经过生物学验证的顺式调控代码。LOESS 长度校正方法具有广泛的适用性,不仅可以更准确地推断顺式调控代码,还可以检测生物序列中的其他类型模式。
源代码和编译代码可从 http://thebrain.bwh.harvard.edu/LM_LOESS/ 获得。