Feng Jianxing, Li Wei, Jiang Tao
School of Life Sciences and Technology, Tongji University, China.
J Comput Biol. 2011 Mar;18(3):305-21. doi: 10.1089/cmb.2010.0243.
Due to alternative splicing events in eukaryotic species, the identification of mRNA isoforms (or splicing variants) is a difficult problem. Traditional experimental methods for this purpose are time consuming and cost ineffective. The emerging RNA-Seq technology provides a possible effective method to address this problem. Although the advantages of RNA-Seq over traditional methods in transcriptome analysis have been confirmed by many studies, the inference of isoforms from millions of short sequence reads (e.g., Illumina/Solexa reads) has remained computationally challenging. In this work, we propose a method to calculate the expression levels of isoforms and infer isoforms from short RNA-Seq reads using exon-intron boundary, transcription start site (TSS) and poly-A site (PAS) information. We first formulate the relationship among exons, isoforms, and single-end reads as a convex quadratic program, and then use an efficient algorithm (called IsoInfer) to search for isoforms. IsoInfer can calculate the expression levels of isoforms accurately if all the isoforms are known and infer novel isoforms from scratch. Our experimental tests on known mouse isoforms with both simulated expression levels and reads demonstrate that IsoInfer is able to calculate the expression levels of isoforms with an accuracy comparable to the state-of-the-art statistical method and a 60 times faster speed. Moreover, our tests on both simulated and real reads show that it achieves a good precision and sensitivity in inferring isoforms when given accurate exon-intron boundary, TSS, and PAS information, especially for isoforms whose expression levels are significantly high. The software is publicly available for free at http://www.cs.ucr.edu/∼jianxing/IsoInfer.html.
由于真核生物物种中存在可变剪接事件,mRNA异构体(或剪接变体)的识别是一个难题。为此目的的传统实验方法既耗时又成本低效。新兴的RNA测序(RNA-Seq)技术提供了一种可能有效的方法来解决这个问题。尽管许多研究已经证实了RNA-Seq在转录组分析中相对于传统方法的优势,但从数百万条短序列读数(例如Illumina/Solexa读数)中推断异构体在计算上仍然具有挑战性。在这项工作中,我们提出了一种方法,利用外显子-内含子边界、转录起始位点(TSS)和多聚腺苷酸位点(PAS)信息来计算异构体的表达水平,并从短RNA-Seq读数中推断异构体。我们首先将外显子、异构体和单端读数之间的关系表述为一个凸二次规划问题,然后使用一种高效算法(称为IsoInfer)来搜索异构体。如果所有异构体都已知,IsoInfer可以准确计算异构体的表达水平,并从头推断新的异构体。我们对具有模拟表达水平和读数的数据进行的实验测试表明,IsoInfer能够以与最先进的统计方法相当的精度计算异构体的表达水平,且速度快60倍。此外,我们对模拟读数和真实读数的测试表明,当给定准确的外显子-内含子边界、TSS和PAS信息时,特别是对于表达水平显著较高的异构体,它在推断异构体方面具有良好的精度和灵敏度。该软件可在http://www.cs.ucr.edu/∼jianxing/IsoInfer.html上免费公开获取。