Bonizzoni Paola, Mauri Giancarlo, Pesole Graziano, Picardi Ernesto, Pirola Yuri, Rizzi Raffaella
Dipartimento di Informatica Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, Milano, Italy.
J Comput Biol. 2009 Jan;16(1):43-66. doi: 10.1089/cmb.2008.0028.
Alternative splicing (AS) is currently considered as one of the main mechanisms able to explain the huge gap between the number of predicted genes and the high complexity of the proteome in humans. The rapid growth of Expressed Sequence Tag (EST) data has encouraged the development of computational methods to predict alternative splicing from the analysis of EST alignment to genome sequences. EST data are also a valuable source to reconstruct the different transcript isoforms that derive from the same gene structure as a consequence of AS, as indeed EST sequences are obtained by fragmenting mRNAs from the same gene. The most recent studies on alternative splice sites detection have revealed that this topic is a quite challenging computational problem, far from a solution. The main computational issues related to the problem of detecting alternative splicing are investigated in this paper, and we analyze algorithmic solutions for this problem. We first formalize an optimization problem related to the prediction of constitutive and alternative splicing sites from EST sequences, the Minimum Exons ESTs Factorization problem (in short, MEF), and show that it is Np-hard, even for restricted instances. This problem leads us to define sets of spliced EST, that is, a set of EST factorized into their constitutive exons with respect to a gene. Then we investigate the computational problem of predicting transcript isoforms from spliced EST sequences. We propose a graph algorithm for the problem that is linear in the number of predicted isoforms and size of the graph. Finally, an experimental analysis of the method is performed to assess the reliability of the predictions.
可变剪接(AS)目前被认为是能够解释人类预测基因数量与蛋白质组高复杂性之间巨大差距的主要机制之一。表达序列标签(EST)数据的快速增长推动了从EST与基因组序列比对分析中预测可变剪接的计算方法的发展。EST数据也是重建由于可变剪接而源自相同基因结构的不同转录本异构体的宝贵来源,因为EST序列确实是通过将来自同一基因的mRNA片段化而获得的。关于可变剪接位点检测的最新研究表明,这个主题是一个极具挑战性的计算问题,远未得到解决。本文研究了与可变剪接检测问题相关的主要计算问题,并分析了针对该问题的算法解决方案。我们首先形式化一个与从EST序列预测组成型和可变剪接位点相关的优化问题,即最小外显子EST分解问题(简称为MEF),并表明即使对于受限实例,它也是NP难的。这个问题促使我们定义剪接EST集,即相对于一个基因分解为其组成型外显子的一组EST。然后我们研究从剪接EST序列预测转录本异构体的计算问题。我们针对该问题提出了一种图算法,该算法在预测异构体数量和图的大小方面是线性的。最后,对该方法进行了实验分析,以评估预测的可靠性。