Jha Tony, Mendel Jovinna, Cho Hyuk, Choudhary Madhusudan
Department of Mathematics, University of California, Berkeley, Berkeley, CA, USA.
Department of Biological Sciences, Sam Houston State University, Huntsville, TX, USA.
Bioinform Biol Insights. 2022 Aug 18;16:11779322221118335. doi: 10.1177/11779322221118335. eCollection 2022.
Small ribonucleic acid (sRNA) sequences are 50-500 nucleotide long, noncoding RNA (ncRNA) sequences that play an important role in regulating transcription and translation within a bacterial cell. As such, identifying sRNA sequences within an organism's genome is essential to understand the impact of the RNA molecules on cellular processes. Recently, numerous machine learning models have been applied to predict sRNAs within bacterial genomes. In this study, we considered the sRNA prediction as an imbalanced binary classification problem to distinguish minor positive sRNAs from major negative ones within imbalanced data and then performed a comparative study with six learning algorithms and seven assessment metrics. First, we collected numerical feature groups extracted from known sRNAs previously identified in LT2 (SLT2) and K12 ( K12) genomes. Second, as a preliminary study, we characterized the sRNA-size distribution with the conformity test for Benford's law. Third, we applied six traditional classification algorithms to sRNA features and assessed classification performance with seven metrics, varying positive-to-negative instance ratios, and utilizing stratified 10-fold cross-validation. We revisited important individual features and feature groups and found that classification with combined features perform better than with either an individual feature or a single feature group in terms of Area Under Precision-Recall curve (AUPR). We reconfirmed that AUPR properly measures classification performance on imbalanced data with varying imbalance ratios, which is consistent with previous studies on classification metrics for imbalanced data. Overall, eXtreme Gradient Boosting (XGBoost), even without exploiting optimal hyperparameter values, performed better than the other five algorithms with specific optimal parameter settings. As a future work, we plan to extend XGBoost further to a large amount of published sRNAs in bacterial genomes and compare its classification performance with recent machine learning models' performance.
小核糖核酸(sRNA)序列是长度为50 - 500个核苷酸的非编码RNA(ncRNA)序列,在细菌细胞内的转录和翻译调控中发挥着重要作用。因此,识别生物体基因组中的sRNA序列对于理解RNA分子对细胞过程的影响至关重要。最近,许多机器学习模型已被应用于预测细菌基因组中的sRNA。在本研究中,我们将sRNA预测视为一个不平衡的二元分类问题,以在不平衡数据中区分少量的阳性sRNA和大量的阴性sRNA,然后使用六种学习算法和七种评估指标进行了比较研究。首先,我们收集了从先前在LT2(SLT2)和K12(K12)基因组中鉴定出的已知sRNA中提取的数值特征组。其次,作为一项初步研究,我们用本福特定律的一致性检验对sRNA大小分布进行了表征。第三,我们将六种传统分类算法应用于sRNA特征,并使用七种指标、不同的正负实例比率以及分层10折交叉验证来评估分类性能。我们重新审视了重要的个体特征和特征组,发现就精确率-召回率曲线下面积(AUPR)而言,使用组合特征进行分类比使用单个特征或单个特征组的效果更好。我们再次证实,AUPR能够正确衡量不同不平衡比率的不平衡数据上的分类性能,这与先前关于不平衡数据分类指标的研究一致。总体而言,极端梯度提升(XGBoost)即使没有使用最优超参数值,在特定最优参数设置下也比其他五种算法表现更好。作为未来的工作,我们计划将XGBoost进一步扩展到细菌基因组中大量已发表的sRNA,并将其分类性能与最近的机器学习模型的性能进行比较。