一种基于混合矩特征建模的piRNA识别集成策略。

An ensemble strategy for piRNA identification through hybrid moment-based feature modeling.

作者信息

Rasheed Mansoor Ahmed, Alkhalifah Tamim, Alturise Fahad, Khan Yaser Daanial

机构信息

School of Systems and Technology, University of Management and Technology, Lahore, Pakistan.

Department of Computer Engineering, College of Computer, Buraydah, Saudi Arabia.

出版信息

Sci Rep. 2025 Aug 18;15(1):30157. doi: 10.1038/s41598-025-14194-7.

DOI:10.1038/s41598-025-14194-7

PMID:40820010

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12358548/

Abstract

This study aims to enhance the accuracy of predicting transposon-derived piRNAs through the development of a novel computational method namely TranspoPred. TranspoPred leverages positional, frequency, and moments-based features extracted from RNA sequences. By integrating multiple deep learning networks, the objective is to create a robust tool for forecasting transposon-derived piRNAs, thereby contributing to a deeper understanding of their biological functions and regulatory mechanisms. Piwi-interacting RNAs (piRNAs) are currently considered the most diverse and abundant class of small, non-coding RNA molecules. Such accurate instrumentation of transposon-associated piRNA tags can considerably involve the study of small ncRNAs and support the understanding of the gametogenesis process. First, a number of moments were adopted for the conversion of the primary sequences into feature vectors. Bagging, boosting, and stacking based ensemble classification approaches were employed during the study. Classifiers such as Random Forest (RF), Extra Trees (ET), and Decision Tree were utilized in the Bagging approach. The Boosting approach involved the use of XGBoost (XGB), AdaBoost, and Gradient Boost. For the Stacking method, base learners such as k-Nearest Neighbor (KNN), Support Vector Machine (SVM), Artificial Neural Network (ANN), and Decision Trees were employed, with a Neural Network (NN) serving as the meta-learner. The computational models underwent rigorous evaluation through 2 × 5-fold cross-validation, 10-fold cross-validation, and independent testing across datasets from three species: human, mouse, and Drosophila. The evaluation metrics used were Accuracy (ACC), Specificity (SP), Sensitivity (SN), and Matthew's Correlation Coefficient (MCC) along with F-1 measure. The ensemble methods consistently outperformed others in almost all testing scenarios. Notably, stacking achieved perfect scores for accuracy, specificity, sensitivity, and MCC in independent set testing for human and Drosophila datasets, and nearly perfect scores for the mouse dataset. Use of independent set testing accross species evaluates the generalizability and adaptability of the model for diverse data samples. The proposed method TranspoRed achieved exquisite results on diverse datasets for humans, mouse and Drosophila. Our methods exhibited superior performance compared to other state-of-the-art techniques for predicting transposon-derived piRNA. The proposed approaches show great potential for enhancing the accuracy of piRNA prediction, significantly aiding future research and the scientific community in the in-silico identification of piRNA. The source codes and datasets utilized in this study are accessible at https://github.com/MansoorAhmadRasheed/piRNA-codes-and-result .

摘要

本研究旨在通过开发一种名为TranspoPred的新型计算方法来提高转座子衍生piRNA预测的准确性。TranspoPred利用从RNA序列中提取的基于位置、频率和矩的特征。通过整合多个深度学习网络，目标是创建一个强大的工具来预测转座子衍生的piRNA，从而有助于更深入地了解它们的生物学功能和调控机制。Piwi相互作用RNA（piRNA）目前被认为是种类最多且最丰富的一类小的非编码RNA分子。对转座子相关piRNA标签进行如此精确的检测可极大地推动对小ncRNA的研究，并有助于理解配子发生过程。首先，采用了一些矩将原始序列转换为特征向量。在研究过程中采用了基于Bagging、Boosting和Stacking的集成分类方法。Bagging方法中使用了随机森林（RF）、极端随机树（ET）和决策树等分类器。Boosting方法涉及使用XGBoost（XGB）、AdaBoost和梯度提升。对于Stacking方法，使用了k近邻（KNN）、支持向量机（SVM）、人工神经网络（ANN）和决策树等基学习器，其中神经网络（NN）作为元学习器。通过2×5折交叉验证、10折交叉验证以及对来自人类、小鼠和果蝇三个物种的数据集进行独立测试，对计算模型进行了严格评估。使用的评估指标包括准确率（ACC）、特异性（SP）、灵敏度（SN）、马修斯相关系数（MCC）以及F1度量。在几乎所有测试场景中，集成方法始终优于其他方法。值得注意的是，Stacking在人类和果蝇数据集的独立集测试中，准确率、特异性、灵敏度和MCC均取得了满分，在小鼠数据集上也取得了近乎满分的成绩。跨物种使用独立集测试评估了模型对不同数据样本的通用性和适应性。所提出的方法TranspoRed在人类、小鼠和果蝇的不同数据集上都取得了出色的结果。与其他用于预测转座子衍生piRNA的最先进技术相比，我们的方法表现出卓越的性能。所提出的方法在提高piRNA预测准确性方面显示出巨大潜力，极大地有助于未来研究以及科学界在计算机上对piRNA进行识别。本研究中使用的源代码和数据集可在https://github.com/MansoorAhmadRasheed/piRNA-codes-and-result获取。