Eppenhof Erik J J, Peña-Castillo Lourdes
Department of Artificial Intelligence, Radboud University Nijmegen, Nijmegen, Netherlands.
Department of Biology, Memorial University of Newfoundland, St. John's, Canada.
PeerJ. 2019 Jan 24;7:e6304. doi: 10.7717/peerj.6304. eCollection 2019.
Bacterial small (sRNAs) are involved in the control of several cellular processes. Hundreds of putative sRNAs have been identified in many bacterial species through RNA sequencing. The existence of putative sRNAs is usually validated by Northern blot analysis. However, the large amount of novel putative sRNAs reported in the literature makes it impractical to validate each of them in the wet lab. In this work, we applied five machine learning approaches to construct twenty models to discriminate bona fide sRNAs from random genomic sequences in five bacterial species. Sequences were represented using seven features including free energy of their predicted secondary structure, their distances to the closest predicted promoter site and Rho-independent terminator, and their distance to the closest open reading frames (ORFs). To automatically calculate these features, we developed an sRNA Characterization Pipeline (sRNACharP). All seven features used in the classification task contributed positively to the performance of the predictive models. The best performing model obtained a median precision of 100% at 10% recall and of 64% at 40% recall across all five bacterial species, and it outperformed previous published approaches on two benchmark datasets in terms of precision and recall. Our results indicate that even though there is limited sRNA sequence conservation across different bacterial species, there are intrinsic features in the genomic context of sRNAs that are conserved across taxa. We show that these features are utilized by machine learning approaches to learn a species-independent model to prioritize bona fide bacterial sRNAs.
细菌小RNA(sRNAs)参与多种细胞过程的调控。通过RNA测序,在许多细菌物种中已鉴定出数百种假定的sRNAs。假定sRNAs的存在通常通过Northern印迹分析来验证。然而,文献中报道的大量新型假定sRNAs使得在湿实验室中对它们逐一进行验证变得不切实际。在这项工作中,我们应用了五种机器学习方法构建了二十个模型,以区分五个细菌物种中真正的sRNAs与随机基因组序列。使用七个特征来表示序列,包括其预测二级结构的自由能、与最接近的预测启动子位点和不依赖Rho的终止子的距离,以及与最接近的开放阅读框(ORFs)的距离。为了自动计算这些特征,我们开发了一个sRNA特征分析管道(sRNACharP)。分类任务中使用的所有七个特征对预测模型的性能都有积极贡献。表现最佳的模型在所有五个细菌物种中,召回率为10%时中位数精度为100%,召回率为40%时中位数精度为64%,并且在精度和召回率方面优于之前在两个基准数据集上发表的方法。我们的结果表明,尽管不同细菌物种之间sRNA序列保守性有限,但sRNAs的基因组背景中存在跨分类群保守的内在特征。我们表明,机器学习方法利用这些特征来学习一个不依赖物种的模型,以对真正的细菌sRNAs进行优先级排序。