Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel.
PLoS Comput Biol. 2024 Aug 26;20(8):e1012385. doi: 10.1371/journal.pcbi.1012385. eCollection 2024 Aug.
MicroRNAs (miRNAs) are small non-coding RNAs that regulate gene expression post-transcriptionally. In animals, this regulation is achieved via base-pairing with partially complementary sequences on mainly 3' UTR region of messenger RNAs (mRNAs). Computational approaches that predict miRNA target interactions (MTIs) facilitate the process of narrowing down potential targets for experimental validation. The availability of new datasets of high-throughput, direct MTIs has led to the development of machine learning (ML) based methods for MTI prediction. To train an ML algorithm, it is beneficial to provide entries from all class labels (i.e., positive and negative). Currently, no high-throughput assays exist for capturing negative examples. Therefore, current ML approaches must rely on either artificially generated or inferred negative examples deduced from experimentally identified positive miRNA-target datasets. Moreover, the lack of uniform standards for generating such data leads to biased results and hampers comparisons between studies. In this comprehensive study, we collected methods for generating negative data for animal miRNA-target interactions and investigated their impact on the classification of true human MTIs. Our study relies on training ML models on a fixed positive dataset in combination with different negative datasets and evaluating their intra- and cross-dataset performance. As a result, we were able to examine each method independently and evaluate ML models' sensitivity to the methodologies utilized in negative data generation. To achieve a deep understanding of the performance results, we analyzed unique features that distinguish between datasets. In addition, we examined whether one-class classification models that utilize solely positive interactions for training are suitable for the task of MTI classification. We demonstrate the importance of negative data in MTI classification, analyze specific methodological characteristics that differentiate negative datasets, and highlight the challenge of ML models generalizing interaction rules from training to testing sets derived from different approaches. This study provides valuable insights into the computational prediction of MTIs that can be further used to establish standards in the field.
微小 RNA(miRNA)是一种小的非编码 RNA,可在后转录水平上调节基因表达。在动物中,这种调节是通过与信使 RNA(mRNA)的主要 3'UTR 区域上部分互补序列的碱基配对来实现的。预测 miRNA 靶标相互作用(MTI)的计算方法有助于缩小实验验证的潜在靶目标的范围。高通量、直接 MTI 新数据集的可用性导致了基于机器学习(ML)的 MTI 预测方法的发展。为了训练 ML 算法,提供所有类别标签(即阳性和阴性)的条目是有益的。目前,没有用于捕获阴性示例的高通量测定法。因此,当前的 ML 方法必须依赖于人工生成或从实验确定的阳性 miRNA-靶数据集推断出的阴性示例。此外,缺乏生成此类数据的统一标准会导致结果产生偏差,并阻碍研究之间的比较。在这项全面的研究中,我们收集了用于生成动物 miRNA-靶相互作用的阴性数据的方法,并研究了它们对真人类 MTI 分类的影响。我们的研究依赖于在固定的阳性数据集上训练 ML 模型,同时结合不同的阴性数据集,并评估它们在数据集内和跨数据集的性能。因此,我们能够独立地检查每种方法,并评估 ML 模型对阴性数据生成中使用的方法学的敏感性。为了深入了解性能结果,我们分析了区分数据集的独特特征。此外,我们还研究了是否仅利用阳性相互作用进行训练的单类分类模型是否适合 MTI 分类任务。我们证明了阴性数据在 MTI 分类中的重要性,分析了区分阴性数据集的特定方法学特征,并强调了 ML 模型从不同方法学衍生的训练和测试集推广相互作用规则的一般性的挑战。这项研究为 MTI 的计算预测提供了有价值的见解,可进一步用于该领域的标准建立。