Meinicke Peter, Tech Maike, Morgenstern Burkhard, Merkl Rainer
Abteilung Bioinformatik, Institut für Mikrobiologie und Genetik, Georg-August-Universität Göttingen, Goldschmidtstr, 1, 37077 Göttingen, Germany.
BMC Bioinformatics. 2004 Oct 28;5:169. doi: 10.1186/1471-2105-5-169.
Kernel-based learning algorithms are among the most advanced machine learning methods and have been successfully applied to a variety of sequence classification tasks within the field of bioinformatics. Conventional kernels utilized so far do not provide an easy interpretation of the learnt representations in terms of positional and compositional variability of the underlying biological signals.
We propose a kernel-based approach to datamining on biological sequences. With our method it is possible to model and analyze positional variability of oligomers of any length in a natural way. On one hand this is achieved by mapping the sequences to an intuitive but high-dimensional feature space, well-suited for interpretation of the learnt models. On the other hand, by means of the kernel trick we can provide a general learning algorithm for that high-dimensional representation because all required statistics can be computed without performing an explicit feature space mapping of the sequences. By introducing a kernel parameter that controls the degree of position-dependency, our feature space representation can be tailored to the characteristics of the biological problem at hand. A regularized learning scheme enables application even to biological problems for which only small sets of example sequences are available. Our approach includes a visualization method for transparent representation of characteristic sequence features. Thereby importance of features can be measured in terms of discriminative strength with respect to classification of the underlying sequences. To demonstrate and validate our concept on a biochemically well-defined case, we analyze E. coli translation initiation sites in order to show that we can find biologically relevant signals. For that case, our results clearly show that the Shine-Dalgarno sequence is the most important signal upstream a start codon. The variability in position and composition we found for that signal is in accordance with previous biological knowledge. We also find evidence for signals downstream of the start codon, previously introduced as transcriptional enhancers. These signals are mainly characterized by occurrences of adenine in a region of about 4 nucleotides next to the start codon.
We showed that the oligo kernel can provide a valuable tool for the analysis of relevant signals in biological sequences. In the case of translation initiation sites we could clearly deduce the most discriminative motifs and their positional variation from example sequences. Attractive features of our approach are its flexibility with respect to oligomer length and position conservation. By means of these two parameters oligo kernels can easily be adapted to different biological problems.
基于核的学习算法是最先进的机器学习方法之一,已成功应用于生物信息学领域的各种序列分类任务。到目前为止所使用的传统核无法根据潜在生物信号的位置和组成变化对学习到的表示进行简单解释。
我们提出了一种基于核的生物序列数据挖掘方法。通过我们的方法,可以以自然的方式对任何长度的寡聚物的位置变化进行建模和分析。一方面,这是通过将序列映射到一个直观但高维的特征空间来实现的,该空间非常适合对学习到的模型进行解释。另一方面,借助核技巧,我们可以为该高维表示提供一种通用的学习算法,因为所有所需的统计量都可以在不执行序列的显式特征空间映射的情况下进行计算。通过引入一个控制位置依赖性程度的核参数,我们的特征空间表示可以根据手头生物问题的特征进行定制。一种正则化学习方案甚至能够应用于只有少量示例序列可用的生物问题。我们的方法包括一种可视化方法,用于透明地表示特征序列特征。由此,可以根据相对于基础序列分类的判别强度来衡量特征的重要性。为了在一个生物化学定义明确的案例中演示和验证我们的概念,我们分析了大肠杆菌的翻译起始位点,以表明我们能够找到生物学上相关的信号。对于该案例,我们的结果清楚地表明,Shine-Dalgarno序列是起始密码子上游最重要的信号。我们发现该信号在位置和组成上的变化与先前的生物学知识一致。我们还发现了起始密码子下游信号的证据,这些信号先前被引入作为转录增强子。这些信号的主要特征是在起始密码子旁边约4个核苷酸的区域中出现腺嘌呤。
我们表明,寡聚物核可以为分析生物序列中的相关信号提供一个有价值的工具。在翻译起始位点的案例中,我们可以从示例序列中清楚地推断出最具判别力的基序及其位置变化。我们方法的吸引人之处在于其在寡聚物长度和位置保守性方面的灵活性。通过这两个参数,寡聚物核可以很容易地适应不同的生物问题。