Wang Chunlin, Ding Chris, Meraz Richard F, Holbrook Stephen R
Physical Biosciences Division, Lawrence Berkeley National Laboratory Berkeley, CA 94720, USA.
Bioinformatics. 2006 Nov 1;22(21):2590-6. doi: 10.1093/bioinformatics/btl441. Epub 2006 Aug 31.
Small non-coding RNA (ncRNA) genes play important regulatory roles in a variety of cellular processes. However, detection of ncRNA genes is a great challenge to both experimental and computational approaches. In this study, we describe a new approach called positive sample only learning (PSoL) to predict ncRNA genes in the Escherichia coli genome. Although PSoL is a machine learning method for classification, it requires no negative training data, which, in general, is hard to define properly and affects the performance of machine learning dramatically. In addition, using the support vector machine (SVM) as the core learning algorithm, PSoL can integrate many different kinds of information to improve the accuracy of prediction. Besides the application of PSoL for predicting ncRNAs, PSoL is applicable to many other bioinformatics problems as well.
The PSoL method is assessed by 5-fold cross-validation experiments which show that PSoL can achieve about 80% accuracy in recovery of known ncRNAs. We compared PSoL predictions with five previously published results. The PSoL method has the highest percentage of predictions overlapping with those from other methods.
小型非编码RNA(ncRNA)基因在多种细胞过程中发挥着重要的调控作用。然而,ncRNA基因的检测对实验方法和计算方法来说都是巨大的挑战。在本研究中,我们描述了一种名为仅正样本学习(PSoL)的新方法,用于预测大肠杆菌基因组中的ncRNA基因。尽管PSoL是一种用于分类的机器学习方法,但它不需要负训练数据,而负训练数据通常很难正确定义,并且会极大地影响机器学习的性能。此外,以支持向量机(SVM)作为核心学习算法,PSoL可以整合许多不同类型的信息以提高预测的准确性。除了将PSoL应用于预测ncRNA外,PSoL也适用于许多其他生物信息学问题。
通过五折交叉验证实验对PSoL方法进行了评估,结果表明PSoL在恢复已知ncRNA方面可以达到约80%的准确率。我们将PSoL的预测结果与之前发表的五个结果进行了比较。PSoL方法与其他方法的预测结果重叠的百分比最高。