Hu Yu, Gouru Anagha, Wang Kai
Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA.
Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
NAR Genom Bioinform. 2023 Mar 3;5(1):lqad019. doi: 10.1093/nargab/lqad019. eCollection 2023 Mar.
Conventional gene expression quantification approaches, such as microarrays or quantitative PCR, have similar variations of estimates for all genes. However, next-generation short-read or long-read sequencing use read counts to estimate expression levels with much wider dynamic ranges. In addition to the accuracy of estimated isoform expression, efficiency, which measures the degree of estimation uncertainty, is also an important factor for downstream analysis. Instead of read count, we present DELongSeq, which employs information matrix of EM algorithm to quantify uncertainty of isoform expression estimates to improve estimation efficiency. DELongSeq uses random-effect regression model for the analysis of DE isoform, in that within-study variation represents variable precision in isoform expression estimation and between-study variation represents variation in isoform expression levels across samples. More importantly, DELongSeq allows 1 case versus 1 control comparison of differential expression, which has specific application scenarios in precision medicine (such as before versus after treatment, or tumor versus stromal tissues). Through extensive simulations and analysis of several RNA-Seq datasets, we show that the uncertainty quantification approach is computationally reliable, and can improve the power of differential expression (DE) analysis of isoforms or genes. In summary, DELongSeq allows for efficient detection of differential isoform/gene expression from long-read RNA-Seq data.
传统的基因表达定量方法,如微阵列或定量PCR,对所有基因的估计都有类似的变化。然而,新一代短读长或长读长测序使用读段计数来估计表达水平,其动态范围要宽得多。除了估计异构体表达的准确性外,衡量估计不确定性程度的效率也是下游分析的一个重要因素。我们提出了DELongSeq,它不是使用读段计数,而是采用EM算法的信息矩阵来量化异构体表达估计的不确定性,以提高估计效率。DELongSeq使用随机效应回归模型来分析差异表达的异构体,其中研究内的变异代表异构体表达估计中的可变精度,而研究间的变异代表样本间异构体表达水平的变异。更重要的是,DELongSeq允许进行1例对1例对照的差异表达比较,这在精准医学中有特定的应用场景(如治疗前后,或肿瘤与基质组织)。通过对几个RNA-Seq数据集的广泛模拟和分析,我们表明不确定性量化方法在计算上是可靠的,并且可以提高异构体或基因差异表达(DE)分析的效能。总之,DELongSeq能够从长读长RNA-Seq数据中高效检测差异异构体/基因表达。