Mersch Britta, Gepperth Alexander, Suhai Sándor, Hotz-Wagenblatt Agnes
Department of Molecular Biophysics, German Cancer Research Center DKFZ, Im Neuenheimer Feld 580, Heidelberg, Germany.
BMC Bioinformatics. 2008 Sep 10;9:369. doi: 10.1186/1471-2105-9-369.
Exonic splicing enhancers (ESEs) activate nearby splice sites and promote the inclusion (vs. exclusion) of exons in which they reside, while being a binding site for SR proteins. To study the impact of ESEs on alternative splicing it would be useful to have a possibility to detect them in exons. Identifying SR protein-binding sites in human DNA sequences by machine learning techniques is a formidable task, since the exon sequences are also constrained by their functional role in coding for proteins.
The choice of training examples needed for machine learning approaches is difficult since there are only few exact locations of human ESEs described in the literature which could be considered as positive examples. Additionally, it is unclear which sequences are suitable as negative examples. Therefore, we developed a motif-oriented data-extraction method that extracts exon sequences around experimentally or theoretically determined ESE patterns. Positive examples are restricted by heuristics based on known properties of ESEs, e.g. location in the vicinity of a splice site, whereas negative examples are taken in the same way from the middle of long exons. We show that a suitably chosen SVM using optimized sequence kernels (e.g., combined oligo kernel) can extract meaningful properties from these training examples. Once the classifier is trained, every potential ESE sequence can be passed to the SVM for verification. Using SVMs with the combined oligo kernel yields a high accuracy of about 90 percent and well interpretable parameters.
The motif-oriented data-extraction method seems to produce consistent training and test data leading to good classification rates and thus allows verification of potential ESE motifs. The best results were obtained using an SVM with the combined oligo kernel, while oligo kernels with oligomers of a certain length could be used to extract relevant features.
外显子剪接增强子(ESEs)可激活附近的剪接位点,并促进其所驻留外显子的包含(相对于排除),同时作为SR蛋白的结合位点。为了研究ESEs对可变剪接的影响,若能在外显子中检测到它们将很有用。通过机器学习技术在人类DNA序列中识别SR蛋白结合位点是一项艰巨的任务,因为外显子序列还受到其在蛋白质编码中的功能作用的限制。
由于文献中描述的人类ESEs的精确位置很少,可被视为正例,因此机器学习方法所需训练示例的选择很困难。此外,尚不清楚哪些序列适合作为负例。因此,我们开发了一种基于基序的数据提取方法,该方法可提取实验或理论确定的ESE模式周围的外显子序列。正例通过基于ESEs已知特性的启发式方法进行限制,例如位于剪接位点附近,而负例则以相同方式从长外显子的中间获取。我们表明,使用优化的序列核(例如组合寡核苷酸核)的适当选择的支持向量机(SVM)可以从这些训练示例中提取有意义的特性。一旦训练了分类器,每个潜在的ESE序列都可以传递给SVM进行验证。使用具有组合寡核苷酸核的SVM可产生约90%的高精度和易于解释的参数。
基于基序的数据提取方法似乎能产生一致的训练和测试数据,从而获得良好的分类率,因此可以验证潜在的ESE基序。使用具有组合寡核苷酸核的SVM可获得最佳结果,而具有特定长度寡聚物的寡核苷酸核可用于提取相关特征。