Department of Computer Science.
Department of Biological Sciences, Purdue University, West Lafayette, IN 47907, USA.
Bioinformatics. 2017 Oct 1;33(19):3018-3027. doi: 10.1093/bioinformatics/btx357.
High-throughput mRNA sequencing (RNA-Seq) is a powerful tool for quantifying gene expression. Identification of transcript isoforms that are differentially expressed in different conditions, such as in patients and healthy subjects, can provide insights into the molecular basis of diseases. Current transcript quantification approaches, however, do not take advantage of the shared information in the biological replicates, potentially decreasing sensitivity and accuracy.
We present a novel hierarchical Bayesian model called Differentially Expressed Isoform detection from Multiple biological replicates (DEIsoM) for identifying differentially expressed (DE) isoforms from multiple biological replicates representing two conditions, e.g. multiple samples from healthy and diseased subjects. DEIsoM first estimates isoform expression within each condition by (1) capturing common patterns from sample replicates while allowing individual differences, and (2) modeling the uncertainty introduced by ambiguous read mapping in each replicate. Specifically, we introduce a Dirichlet prior distribution to capture the common expression pattern of replicates from the same condition, and treat the isoform expression of individual replicates as samples from this distribution. Ambiguous read mapping is modeled as a multinomial distribution, and ambiguous reads are assigned to the most probable isoform in each replicate. Additionally, DEIsoM couples an efficient variational inference and a post-analysis method to improve the accuracy and speed of identification of DE isoforms over alternative methods. Application of DEIsoM to an hepatocellular carcinoma (HCC) dataset identifies biologically relevant DE isoforms. The relevance of these genes/isoforms to HCC are supported by principal component analysis (PCA), read coverage visualization, and the biological literature.
The software is available at https://github.com/hao-peng/DEIsoM.
Supplementary data are available at Bioinformatics online.
高通量 mRNA 测序(RNA-Seq)是定量基因表达的有力工具。鉴定在不同条件下(如患者和健康受试者)差异表达的转录本异构体,可以深入了解疾病的分子基础。然而,当前的转录本定量方法并没有利用生物重复样本中的共享信息,这可能会降低敏感性和准确性。
我们提出了一种新的层次贝叶斯模型,称为从多个生物学重复中检测差异表达的异构体(DEIsoM),用于从代表两种条件的多个生物学重复中鉴定差异表达的异构体,例如,来自健康和患病受试者的多个样本。DEIsoM 首先通过以下两种方法估计每个条件下的异构体表达:(1)捕获来自同一条件的样本重复中的共同模式,同时允许个体差异,(2)在每个重复中对模糊读映射引入的不确定性进行建模。具体来说,我们引入了一个 Dirichlet 先验分布来捕获同一条件下重复的共同表达模式,并将个体重复的异构体表达视为从该分布中抽取的样本。模糊读映射被建模为一个多项分布,并且模糊读被分配到每个重复中最可能的异构体。此外,DEIsoM 结合了有效的变分推理和后分析方法,以提高识别差异表达异构体的准确性和速度,优于其他方法。将 DEIsoM 应用于肝细胞癌(HCC)数据集,可以识别出具有生物学意义的差异表达异构体。这些基因/异构体与 HCC 的相关性得到了主成分分析(PCA)、读覆盖可视化和生物文献的支持。
该软件可在 https://github.com/hao-peng/DEIsoM 上获得。
补充数据可在《生物信息学》在线获得。