Peng Lihong, Yuan Ruya, Shen Ling, Gao Pengfei, Zhou Liqian
School of Computer Science, Hunan University of Technology, No.88, Taishan West Road, Tianyuan District, Zhuzhou, China.
College of Life Sciences and Chemistry, Hunan University of Technology, No.88, Taishan West Road, Tianyuan District, Zhuzhou, China.
BioData Min. 2021 Dec 3;14(1):50. doi: 10.1186/s13040-021-00277-4.
Long noncoding RNAs (lncRNAs) have dense linkages with various biological processes. Identifying interacting lncRNA-protein pairs contributes to understand the functions and mechanisms of lncRNAs. Wet experiments are costly and time-consuming. Most computational methods failed to observe the imbalanced characterize of lncRNA-protein interaction (LPI) data. More importantly, they were measured based on a unique dataset, which produced the prediction bias.
In this study, we develop an Ensemble framework (LPI-EnEDT) with Extra tree and Decision Tree classifiers to implement imbalanced LPI data classification. First, five LPI datasets are arranged. Second, lncRNAs and proteins are separately characterized based on Pyfeat and BioTriangle and concatenated as a vector to represent each lncRNA-protein pair. Finally, an ensemble framework with Extra tree and decision tree classifiers is developed to classify unlabeled lncRNA-protein pairs. The comparative experiments demonstrate that LPI-EnEDT outperforms four classical LPI prediction methods (LPI-BLS, LPI-CatBoost, LPI-SKF, and PLIPCOM) under cross validations on lncRNAs, proteins, and LPIs. The average AUC values on the five datasets are 0.8480, 0,7078, and 0.9066 under the three cross validations, respectively. The average AUPRs are 0.8175, 0.7265, and 0.8882, respectively. Case analyses suggest that there are underlying associations between HOTTIP and Q9Y6M1, NRON and Q15717.
Fusing diverse biological features of lncRNAs and proteins and exploiting an ensemble learning model with Extra tree and decision tree classifiers, this work focus on imbalanced LPI data classification as well as interaction information inference for a new lncRNA (or protein).
长链非编码RNA(lncRNA)与多种生物学过程紧密相关。识别相互作用的lncRNA-蛋白质对有助于理解lncRNA的功能和机制。湿实验成本高且耗时。大多数计算方法未能观察到lncRNA-蛋白质相互作用(LPI)数据的不平衡特征。更重要的是,它们是基于单一数据集进行测量的,这会产生预测偏差。
在本研究中,我们开发了一个集成框架(LPI-EnEDT),结合Extra树和决策树分类器来实现不平衡LPI数据分类。首先,整理了五个LPI数据集。其次,基于Pyfeat和BioTriangle分别对lncRNA和蛋白质进行特征提取,并将其连接成一个向量来表示每个lncRNA-蛋白质对。最后,开发了一个结合Extra树和决策树分类器的集成框架,对未标记的lncRNA-蛋白质对进行分类。对比实验表明,在lncRNA、蛋白质和LPI的交叉验证中,LPI-EnEDT优于四种经典的LPI预测方法(LPI-BLS、LPI-CatBoost、LPI-SKF和PLIPCOM)。在三个交叉验证下,五个数据集上的平均AUC值分别为0.8480、0.7078和0.9066。平均AUPR分别为0.8175、0.7265和0.8882。案例分析表明,HOTTIP与Q9Y6M1、NRON与Q15717之间存在潜在关联。
通过融合lncRNA和蛋白质的多种生物学特征,并利用结合Extra树和决策树分类器的集成学习模型,本研究致力于不平衡LPI数据分类以及对新lncRNA(或蛋白质)的相互作用信息推断。