Chuang Trees-Juen, Chen Feng-Chi, Chou Meng-Yuan
Genomics Research Center, Academia, Sinica, Taipei, Taiwan.
Bioinformatics. 2004 Nov 22;20(17):3064-79. doi: 10.1093/bioinformatics/bth368. Epub 2004 Jun 24.
Alternative splicing (AS) serves as a mechanism to create diversity among functional proteins. Increasing evidence indicates that a large portion of genes have AS forms. Hence AS variants should be considered while analyzing gene structures.
A new cross-species gene identification and AS analysis system, PSEP, has been developed. The system is based on expressed sequence tag (EST)-to-genome and genome-to-genome comparisons and is implemented in two steps: sequence alignment and a series of post-alignment processes, including progressive signal extraction and patching. For gene identification, these post-alignment processes serve as noise filters and enable PSEP to eliminate approximately 88% of potential overprediction. The overall accuracy of PSEP is better than or comparable to that of other well-known cross-species gene prediction programs, including the ROSETTA program, TWINSCAN, SGP-1/-2 and SLAM, when tested on three benchmark datasets (the ELN gene region, the HoxA cluster and the ROSETTA set). In addition, 76.2 and 76.0% of multiple-exon genes in the ROSETTA dataset and human chromosome 20, respectively, are found to have AS forms. Approximately 23% of the 210 elementary alternatives identified in the ROSETTA dataset are not conserved between the human and mouse genomes, and none of the 210 transcripts is found in the RefSeq annotation. With its dual functions in cross-species conserved sequence analysis and AS analysis, PSEP is highly suitable for studying the evolution of AS patterns and for finding unidentified gene expression features.
可变剪接(AS)是一种在功能蛋白之间产生多样性的机制。越来越多的证据表明,很大一部分基因具有AS形式。因此,在分析基因结构时应考虑AS变体。
开发了一种新的跨物种基因识别和AS分析系统PSEP。该系统基于表达序列标签(EST)与基因组以及基因组与基因组的比较,并分两步实施:序列比对和一系列比对后处理,包括渐进信号提取和拼接。对于基因识别,这些比对后处理充当噪声过滤器,使PSEP能够消除约88%的潜在过度预测。在三个基准数据集(ELN基因区域、HoxA簇和ROSETTA集)上进行测试时,PSEP的总体准确性优于或与其他知名的跨物种基因预测程序相当,包括ROSETTA程序、TWINSCAN、SGP-1/-2和SLAM。此外,分别在ROSETTA数据集中和人类20号染色体上发现76.2%和76.0%的多外显子基因具有AS形式。在ROSETTA数据集中鉴定出的210种基本可变剪接形式中,约23%在人类和小鼠基因组之间不保守,并且在RefSeq注释中未发现这210种转录本中的任何一种。由于PSEP在跨物种保守序列分析和AS分析方面具有双重功能,它非常适合研究AS模式的进化以及发现未识别的基因表达特征。