评估序列编码方案和机器学习方法在剪接位点识别中的性能。

Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition.

机构信息

ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.

出版信息

Gene. 2019 Jul 15;705:113-126. doi: 10.1016/j.gene.2019.04.047. Epub 2019 Apr 19.

DOI:10.1016/j.gene.2019.04.047

Abstract

Identification of splice sites is imperative for prediction of gene structure. Machine learning-based approaches (MLAs) have been reported to be more successful than the rule-based methods for identification of splice sites. However, the strings of alphabets should be transformed into numeric features through sequence encoding before using them as input in MLAs. In this study, we evaluated the performances of 8 different sequence encoding schemes i.e., Bayes kernel, density and sparse (DS), distribution of tri-nucleotide and 1st order Markov model (DM), frequency difference distance measure (FDDM), paired-nucleotide frequency difference between true and false sites (FDTF), 1st order Markov model (MM1), combination of both 1st and 2nd order Markov model (MM1 + MM2) and 2nd order Markov model (MM2) in respect of predicting donor and acceptor splice sites using 5 supervised learning methods (ANN, Bagging, Boosting, RF and SVM). The encoding schemes and machine learning methods were first evaluated in 4 species i.e., A. thaliana, C. elegans, D. melanogaster and H. sapiens, and then performances were validated with another four species i.e., Ciona intestinalis, Dictyostelium discoideum, Phaeodactylum tricornutum and Trypanosoma brucei. In terms of ROC (receiver-operating-characteristics) and PR (precision-recall) curves, FDTF encoding approach achieved higher accuracy followed by either MM2 or FDDM. Further, SVM was found to achieve higher accuracy (in terms of ROC and PR curves) followed by RF across encoding schemes and species. In terms of prediction accuracy across species, the SVM-FDTF combination was optimum than other combinations of classifiers and encoding schemes. Further, splice site prediction accuracies were observed higher for the species with low intron density. To our limited knowledge, this is the first attempt as far as comprehensive evaluation of sequence encoding schemes for prediction of splice sites is concerned. We have also developed an R-package EncDNA (https://cran.r-project.org/web/packages/EncDNA/index.html) for encoding of splice site motifs with different encoding schemes, which is expected to supplement the existing nucleotide sequence encoding approaches. This study is believed to be useful for the computational biologists for predicting different functional elements on the genomic DNA.

摘要

识别剪接位点对于预测基因结构至关重要。基于机器学习的方法（MLAs）在识别剪接位点方面比基于规则的方法更成功。然而，在将这些字母串作为输入用于 MLAs 之前，应该通过序列编码将其转换为数字特征。在这项研究中，我们评估了 8 种不同序列编码方案的性能，即贝叶斯核、密度和稀疏（DS）、三核苷酸分布和一阶马尔可夫模型（DM）、频率差距离度量（FDDM）、真核和假核之间的碱基对频率差（FDTF）、一阶马尔可夫模型（MM1）、一阶和二阶马尔可夫模型的组合（MM1+MM2）和二阶马尔可夫模型（MM2），用于使用 5 种监督学习方法（ANN、Bagging、Boosting、RF 和 SVM）预测供体和受体剪接位点。首先在 4 个物种（拟南芥、秀丽隐杆线虫、黑腹果蝇和智人）中评估编码方案和机器学习方法，然后在另外 4 个物种（海鞘、盘基网柄菌、三角褐指藻和布氏锥虫）中验证性能。在 ROC（接收器操作特征）和 PR（精度-召回）曲线方面，FDTF 编码方法的准确性最高，其次是 MM2 或 FDDM。此外，在编码方案和物种方面，SVM 被发现比 RF 具有更高的准确性（在 ROC 和 PR 曲线方面）。在跨物种的预测准确性方面，SVM-FDTF 组合优于其他分类器和编码方案的组合。此外，对于内含子密度较低的物种，剪接位点预测的准确性更高。据我们所知，这是迄今为止针对剪接位点预测的序列编码方案进行全面评估的首次尝试。我们还开发了一个 R 包 EncDNA（https://cran.r-project.org/web/packages/EncDNA/index.html），用于使用不同的编码方案对剪接位点基序进行编码，预计这将补充现有的核苷酸序列编码方法。这项研究有望为计算生物学家预测基因组 DNA 上的不同功能元件提供帮助。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

评估序列编码方案和机器学习方法在剪接位点识别中的性能。

Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition.

机构信息

出版信息

相似文献

引用本文的文献

评估序列编码方案和机器学习方法在剪接位点识别中的性能。

Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition.

机构信息

出版信息

相似文献

引用本文的文献