Schulze Uta, Hepp Bettina, Ong Cheng Soon, Rätsch Gunnar
Friedrich Miescher Laboratory, Max Planck Society, Tübingen, Germany.
Bioinformatics. 2007 Aug 1;23(15):1892-900. doi: 10.1093/bioinformatics/btm275. Epub 2007 May 30.
Despite many years of research on how to properly align sequences in the presence of sequencing errors, alternative splicing and micro-exons, the correct alignment of mRNA sequences to genomic DNA is still a challenging task.
We present a novel approach based on large margin learning that combines accurate splice site predictions with common sequence alignment techniques. By solving a convex optimization problem, our algorithm-called PALMA-tunes the parameters of the model such that true alignments score higher than other alignments. We study the accuracy of alignments of mRNAs containing artificially generated micro-exons to genomic DNA. In a carefully designed experiment, we show that our algorithm accurately identifies the intron boundaries as well as boundaries of the optimal local alignment. It outperforms all other methods: for 5702 artificially shortened EST sequences from Caenorhabditis elegans and human, it correctly identifies the intron boundaries in all except two cases. The best other method is a recently proposed method called exalin which misaligns 37 of the sequences. Our method also demonstrates robustness to mutations, insertions and deletions, retaining accuracy even at high noise levels.
Datasets for training, evaluation and testing, additional results and a stand-alone alignment tool implemented in C++ and python are available at http://www.fml.mpg.de/raetsch/projects/palma
尽管多年来一直在研究如何在存在测序错误、可变剪接和微小外显子的情况下正确比对序列,但将mRNA序列与基因组DNA进行正确比对仍然是一项具有挑战性的任务。
我们提出了一种基于大间隔学习的新方法,该方法将准确的剪接位点预测与常见的序列比对技术相结合。通过解决一个凸优化问题,我们的算法——称为PALMA——调整模型参数,以使真实比对的得分高于其他比对。我们研究了包含人工生成的微小外显子的mRNA与基因组DNA的比对准确性。在一个精心设计的实验中,我们表明我们的算法能够准确识别内含子边界以及最优局部比对的边界。它优于所有其他方法:对于来自秀丽隐杆线虫和人类的5702条人工缩短的EST序列,除了两个案例外,它在所有情况下都能正确识别内含子边界。另一种最好的方法是最近提出的名为exalin的方法,它错误比对了37条序列。我们的方法还展示了对突变、插入和缺失的鲁棒性,即使在高噪声水平下也能保持准确性。
可在http://www.fml.mpg.de/raetsch/projects/palma获得用于训练、评估和测试的数据集、其他结果以及用C++和Python实现的独立比对工具。