Suppr超能文献

一种基于机器学习的新框架,用于RNA测序读段比对和基因表达估计中的不确定性分析映射。

A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation.

作者信息

McDermaid Adam, Chen Xin, Zhang Yiran, Wang Cankun, Gu Shaopeng, Xie Juan, Ma Qin

机构信息

Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture, and Plant Science, South Dakota State University, Brookings, SD, United States.

Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, United States.

出版信息

Front Genet. 2018 Aug 14;9:313. doi: 10.3389/fgene.2018.00313. eCollection 2018.

Abstract

One of the main benefits of using modern RNA-Sequencing (RNA-Seq) technology is the more accurate gene expression estimations compared with previous generations of expression data, such as the microarray. However, numerous issues can result in the possibility that an RNA-Seq read can be mapped to multiple locations on the reference genome with the same alignment scores, which occurs in plant, animal, and metagenome samples. Such a read is so-called a multiple-mapping read (MMR). The impact of these MMRs is reflected in gene expression estimation and all downstream analyses, including differential gene expression, functional enrichment, etc. Current analysis pipelines lack the tools to effectively test the reliability of gene expression estimations, thus are incapable of ensuring the validity of all downstream analyses. Our investigation into 95 RNA-Seq datasets from seven plant and animal species (totaling 1,951 GB) indicates an average of roughly 22% of all reads are MMRs. Here we present a machine learning-based tool called ( expression uality ontrol), which can accurately estimate the reliability of each gene's expression level derived from an RNA-Seq dataset. The underlying algorithm is designed based on extracted genomic and transcriptomic features, which are then combined using elastic-net regularization and mixture model fitting to provide a clearer picture of mapping uncertainty for each gene. GeneQC allows researchers to determine reliable expression estimations and conduct further analysis on the gene expression that is of sufficient quality. This tool also enables researchers to investigate continued re-alignment methods to determine more accurate gene expression estimates for those with low reliability. Application of GeneQC reveals high level of mapping uncertainty in plant samples and limited, severe mapping uncertainty in animal samples. GeneQC is freely available at http://bmbl.sdstate.edu/GeneQC/home.html.

摘要

使用现代RNA测序(RNA-Seq)技术的主要好处之一是,与前几代表达数据(如微阵列)相比,基因表达估计更为准确。然而,许多问题可能导致RNA-Seq读数有可能以相同的比对分数映射到参考基因组上的多个位置,这种情况在植物、动物和宏基因组样本中都会出现。这样的读数就是所谓的多映射读数(MMR)。这些MMR的影响反映在基因表达估计以及所有下游分析中,包括差异基因表达、功能富集等。当前的分析流程缺乏有效测试基因表达估计可靠性的工具,因此无法确保所有下游分析的有效性。我们对来自七个动植物物种的95个RNA-Seq数据集(总计1951GB)的调查表明,所有读数中平均约有22%是MMR。在此,我们展示了一种基于机器学习的工具,称为基因表达质量控制(GeneQC),它可以准确估计从RNA-Seq数据集中得出的每个基因表达水平的可靠性。其底层算法是基于提取的基因组和转录组特征设计的,然后使用弹性网络正则化和混合模型拟合将这些特征组合起来,以更清晰地呈现每个基因的映射不确定性情况。GeneQC使研究人员能够确定可靠的表达估计,并对质量足够的基因表达进行进一步分析。该工具还使研究人员能够研究持续重新比对方法,以确定那些可靠性较低的基因的更准确表达估计。GeneQC的应用揭示了植物样本中存在高度的映射不确定性,而动物样本中的映射不确定性有限且严重程度较低。GeneQC可在http://bmbl.sdstate.edu/GeneQC/home.html免费获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3f6/6102479/5bc393e078d5/fgene-09-00313-g0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验