Tran Van Du T, Tempel Sebastien, Zerath Benjamin, Zehraoui Farida, Tahi Fariza
IBISC - IBGBI, University of Evry, 91037 Evry CEDEX, France Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland.
IBISC - IBGBI, University of Evry, 91037 Evry CEDEX, France LCB, CNRS UMR 7283, 13009 Marseille, France.
RNA. 2015 May;21(5):775-85. doi: 10.1261/rna.043612.113. Epub 2015 Mar 20.
Identification of microRNAs (miRNAs) is an important step toward understanding post-transcriptional gene regulation and miRNA-related pathology. Difficulties in identifying miRNAs through experimental techniques combined with the huge amount of data from new sequencing technologies have made in silico discrimination of bona fide miRNA precursors from non-miRNA hairpin-like structures an important topic in bioinformatics. Among various techniques developed for this classification problem, machine learning approaches have proved to be the most promising. However these approaches require the use of training data, which is problematic due to an imbalance in the number of miRNAs (positive data) and non-miRNAs (negative data), which leads to a degradation of their performance. In order to address this issue, we present an ensemble method that uses a boosting technique with support vector machine components to deal with imbalanced training data. Classification is performed following a feature selection on 187 novel and existing features. The algorithm, miRBoost, performed better in comparison with state-of-the-art methods on imbalanced human and cross-species data. It also showed the highest ability among the tested methods for discovering novel miRNA precursors. In addition, miRBoost was over 1400 times faster than the second most accurate tool tested and was significantly faster than most of the other tools. miRBoost thus provides a good compromise between prediction efficiency and execution time, making it highly suitable for use in genome-wide miRNA precursor prediction. The software miRBoost is available on our web server http://EvryRNA.ibisc.univ-evry.fr.
鉴定微小RNA(miRNA)是理解转录后基因调控及miRNA相关病理学的重要一步。通过实验技术鉴定miRNA存在困难,再加上来自新测序技术的海量数据,使得在计算机上从非miRNA发夹样结构中区分真正的miRNA前体成为生物信息学中的一个重要课题。在为解决此分类问题而开发的各种技术中,机器学习方法已被证明是最有前景的。然而,这些方法需要使用训练数据,由于miRNA(阳性数据)和非miRNA(阴性数据)数量不平衡,这会导致问题出现,进而导致其性能下降。为了解决这个问题,我们提出了一种集成方法,该方法使用带有支持向量机组件的增强技术来处理不平衡的训练数据。分类是在对187个新的和现有的特征进行特征选择之后进行的。该算法miRBoost在不平衡的人类数据和跨物种数据上与现有最先进的方法相比表现更好。在测试方法中,它还显示出发现新miRNA前体的能力最强。此外,miRBoost比测试的第二精确工具快1400多倍,并且比大多数其他工具都要快得多。因此,miRBoost在预测效率和执行时间之间提供了良好的平衡,使其非常适合用于全基因组miRNA前体预测。软件miRBoost可在我们的网页服务器http://EvryRNA.ibisc.univ-evry.fr上获取。