Suppr超能文献

基于机器学习的 miRNA 前体的全基因组发现:最新方法比较。

Genome-wide discovery of pre-miRNAs: comparison of recent approaches based on machine learning.

机构信息

Research Institute for Signals, Systems and Computational Intelligence sinc(i), FICH/UNL-CONICET, Ciudad Universitaria, Santa Fe, Argentina.

出版信息

Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa184.

Abstract

MOTIVATION

The genome-wide discovery of microRNAs (miRNAs) involves identifying sequences having the highest chance of being a novel miRNA precursor (pre-miRNA), within all the possible sequences in a complete genome. The known pre-miRNAs are usually just a few in comparison to the millions of candidates that have to be analyzed. This is of particular interest in non-model species and recently sequenced genomes, where the challenge is to find potential pre-miRNAs only from the sequenced genome. The task is unfeasible without the help of computational methods, such as deep learning. However, it is still very difficult to find an accurate predictor, with a low false positive rate in this genome-wide context. Although there are many available tools, these have not been tested in realistic conditions, with sequences from whole genomes and the high class imbalance inherent to such data.

RESULTS

In this work, we review six recent methods for tackling this problem with machine learning. We compare the models in five genome-wide datasets: Arabidopsis thaliana, Caenorhabditis elegans, Anopheles gambiae, Drosophila melanogaster, Homo sapiens. The models have been designed for the pre-miRNAs prediction task, where there is a class of interest that is significantly underrepresented (the known pre-miRNAs) with respect to a very large number of unlabeled samples. It was found that for the smaller genomes and smaller imbalances, all methods perform in a similar way. However, for larger datasets such as the H. sapiens genome, it was found that deep learning approaches using raw information from the sequences reached the best scores, achieving low numbers of false positives.

AVAILABILITY

The source code to reproduce these results is in: http://sourceforge.net/projects/sourcesinc/files/gwmirna Additionally, the datasets are freely available in: https://sourceforge.net/projects/sourcesinc/files/mirdata.

摘要

动机

全基因组范围内的 microRNA(miRNA)发现涉及识别在完整基因组的所有可能序列中,最有可能成为新 miRNA 前体(pre-miRNA)的序列。与需要分析的数百万个候选序列相比,已知的 pre-miRNA 通常只有少数几个。在非模式物种和最近测序的基因组中,这一点特别重要,因为在这些情况下,挑战在于仅从测序基因组中找到潜在的 pre-miRNA。如果没有计算方法(例如深度学习)的帮助,这项任务是无法完成的。然而,在这种全基因组范围内,要找到一个具有低假阳性率的准确预测器仍然非常困难。尽管有许多可用的工具,但这些工具尚未在真实条件下进行测试,因为这些工具使用的是来自整个基因组的序列以及此类数据固有的高类别不平衡。

结果

在这项工作中,我们回顾了最近使用机器学习解决此问题的六种方法。我们在五个全基因组数据集(拟南芥、秀丽隐杆线虫、冈比亚按蚊、黑腹果蝇、智人)中比较了模型。这些模型是为 pre-miRNA 预测任务而设计的,在该任务中,与大量未标记的样本相比,感兴趣的类别(已知的 pre-miRNA)明显代表性不足。结果发现,对于较小的基因组和较小的不平衡,所有方法的表现方式都相似。然而,对于更大的数据集,如人类基因组,发现使用序列原始信息的深度学习方法达到了最佳分数,假阳性数量较低。

可用性

重现这些结果的源代码可在:http://sourceforge.net/projects/sourcesinc/files/gwmirna 此外,数据集可在:https://sourceforge.net/projects/sourcesinc/files/mirdata 中免费获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验