Li Wei, Feng Jianxing, Jiang Tao
Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA 92507, USA.
J Comput Biol. 2011 Nov;18(11):1693-707. doi: 10.1089/cmb.2011.0171. Epub 2011 Sep 27.
The new second generation sequencing technology revolutionizes many biology-related research fields and poses various computational biology challenges. One of them is transcriptome assembly based on RNA-Seq data, which aims at reconstructing all full-length mRNA transcripts simultaneously from millions of short reads. In this article, we consider three objectives in transcriptome assembly: the maximization of prediction accuracy, minimization of interpretation, and maximization of completeness. The first objective, the maximization of prediction accuracy, requires that the estimated expression levels based on assembled transcripts should be as close as possible to the observed ones for every expressed region of the genome. The minimization of interpretation follows the parsimony principle to seek as few transcripts in the prediction as possible. The third objective, the maximization of completeness, requires that the maximum number of mapped reads (or ?expressed segments? in gene models) be explained by (i.e., contained in) the predicted transcripts in the solution. Based on the above three objectives, we present IsoLasso, a new RNA-Seq based transcriptome assembly tool. IsoLasso is based on the well-known LASSO algorithm, a multivariate regression method designated to seek a balance between the maximization of prediction accuracy and the minimization of interpretation. By including some additional constraints in the quadratic program involved in LASSO, IsoLasso is able to make the set of assembled transcripts as complete as possible. Experiments on simulated and real RNA-Seq datasets show that IsoLasso achieves, simultaneously, higher sensitivity and precision than the state-of-art transcript assembly tools.
新一代测序技术革新了许多与生物学相关的研究领域,并带来了各种计算生物学挑战。其中之一是基于RNA测序数据的转录组组装,其目的是从数百万条短读段中同时重建所有全长mRNA转录本。在本文中,我们考虑转录组组装中的三个目标:预测准确性最大化、解读最小化和完整性最大化。第一个目标,即预测准确性最大化,要求基于组装转录本估计的表达水平应尽可能接近基因组每个表达区域的观测值。解读最小化遵循简约原则,在预测中寻求尽可能少的转录本。第三个目标,即完整性最大化,要求预测转录本在解决方案中解释(即包含)最大数量的比对读段(或基因模型中的“表达片段”)。基于上述三个目标,我们提出了IsoLasso,一种基于RNA测序的新型转录组组装工具。IsoLasso基于著名的套索算法,这是一种多元回归方法,旨在在预测准确性最大化和解读最小化之间寻求平衡。通过在套索算法涉及的二次规划中纳入一些额外约束,IsoLasso能够使组装转录本集尽可能完整。在模拟和真实RNA测序数据集上的实验表明,IsoLasso同时实现了比现有转录本组装工具更高的灵敏度和精度。