基于图的算法用于 RNA-seq 数据标准化。

A graph-based algorithm for RNA-seq data normalization.

机构信息

School of Computing, University of Utah, Salt Lake City, Utah, United States of America.

Department of Medicinal Chemistry, University of Utah, Salt Lake City, Utah, United States of America.

出版信息

PLoS One. 2020 Jan 24;15(1):e0227760. doi: 10.1371/journal.pone.0227760. eCollection 2020.

DOI:10.1371/journal.pone.0227760

PMID:31978105

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6980396/

Abstract

The use of RNA-sequencing has garnered much attention in recent years for characterizing and understanding various biological systems. However, it remains a major challenge to gain insights from a large number of RNA-seq experiments collectively, due to the normalization problem. Normalization has been challenging due to an inherent circularity, requiring that RNA-seq data be normalized before any pattern of differential (or non-differential) expression can be ascertained; meanwhile, the prior knowledge of non-differential transcripts is crucial to the normalization process. Some methods have successfully overcome this problem by the assumption that most transcripts are not differentially expressed. However, when RNA-seq profiles become more abundant and heterogeneous, this assumption fails to hold, leading to erroneous normalization. We present a normalization procedure that does not rely on this assumption, nor prior knowledge about the reference transcripts. This algorithm is based on a graph constructed from intrinsic correlations among RNA-seq transcripts and seeks to identify a set of densely connected vertices as references. Application of this algorithm on our synthesized validation data showed that it could recover the reference transcripts with high precision, thus resulting in high-quality normalization. On a realistic data set from the ENCODE project, this algorithm gave good results and could finish in a reasonable time. These preliminary results imply that we may be able to break the long persisting circularity problem in RNA-seq normalization.

摘要

近年来，RNA 测序在描述和理解各种生物系统方面引起了广泛关注。然而，由于标准化问题，要从大量的 RNA 测序实验中获得深入的见解仍然是一个主要挑战。由于存在内在的循环性，标准化具有挑战性，需要在确定任何差异（或非差异）表达模式之前对 RNA 测序数据进行标准化；同时，非差异转录本的先验知识对标准化过程至关重要。一些方法通过假设大多数转录本没有差异表达成功地克服了这个问题。然而，当 RNA 测序谱变得更加丰富和异质时，这种假设不再成立，导致错误的标准化。我们提出了一种不依赖于该假设的标准化程序，也不依赖于参考转录本的先验知识。该算法基于从 RNA 测序转录本之间的内在相关性构建的图，并试图识别一组密集连接的顶点作为参考。在我们合成的验证数据上应用此算法表明，它可以高精度地恢复参考转录本，从而实现高质量的标准化。在 ENCODE 项目的一个现实数据集上，该算法给出了良好的结果，并可以在合理的时间内完成。这些初步结果表明，我们可能能够打破 RNA 测序标准化中长期存在的循环问题。