Harhay Gregory P, Sonstegard Tad S, Keele John W, Heaton Michael P, Clawson Michael L, Snelling Warren M, Wiedmann Ralph T, Van Tassell Curt P, Smith Timothy P L
USDA-ARS-U,S, Meat Animal Research Center, Clay Center, NE 68901, USA.
BMC Genomics. 2005 Nov 23;6:166. doi: 10.1186/1471-2164-6-166.
Genome assemblies rely on the existence of transcript sequence to stitch together contigs, verify assembly of whole genome shotgun reads, and annotate genes. Functional genomics studies also rely on transcript sequence to create expression microarrays or interpret digital tag data produced by methods such as Serial Analysis of Gene Expression (SAGE). Transcript sequence can be predicted based on reconstruction from overlapping expressed sequence tags (EST) that are obtained by single-pass sequencing of random cDNA clones, but these reconstructions are prone to errors caused by alternative splice forms, transcripts from gene families with related sequences, and expressed pseudogenes. These errors confound genome assembly and annotation. The most useful transcript sequences are derived by complete insert sequencing of clones containing the entire length, or at least the full protein coding sequence (CDS) portion, of the source mRNA. While the bovine genome sequencing initiative is nearing completion, there is currently a paucity of bovine full-CDS mRNA and protein sequence data to support bovine genome assembly and functional genomics studies. Consequently, the production of high-quality bovine full-CDS cDNA sequences will enhance the bovine genome assembly and functional studies of bovine genes and gene products. The goal of this investigation was to identify and characterize the full-CDS sequences of bovine transcripts from clones identified in non-full-length enriched cDNA libraries. In contrast to several recent full-length cDNA investigations, these full-CDS cDNAs were selected, sequenced, and annotated without the benefit of the target organism's genomic sequence, by using comparison of bovine EST sequence to existing human mRNA to identify likely full-CDS clones for full-length insert cDNA (FLIC) sequencing.
The predicted bovine protein lengths, 5' UTR lengths, and Kozak consensus sequences from 954 bovine FLIC sequences (bFLICs; average length 1713 nt, representing 762 distinct loci) are all consistent with previously sequenced mammalian full-length transcripts.
In most cases, the bFLICs span the entire CDS of the genes, providing the basis for creating predicted bovine protein sequences to support proteomics and comparative evolutionary research as well as functional genomics and genome annotation. The results demonstrate the utility of the comparative approach in obtaining predicted protein sequences in other species.
基因组组装依赖于转录本序列的存在,以拼接重叠群、验证全基因组鸟枪法测序读数的组装以及注释基因。功能基因组学研究也依赖转录本序列来创建表达微阵列或解释由诸如基因表达系列分析(SAGE)等方法产生的数字标签数据。转录本序列可以基于从通过对随机cDNA克隆进行单通道测序获得的重叠表达序列标签(EST)进行重建来预测,但这些重建容易受到由可变剪接形式、来自具有相关序列的基因家族的转录本以及表达的假基因引起的错误影响。这些错误混淆了基因组组装和注释。最有用的转录本序列是通过对包含源mRNA全长或至少完整蛋白质编码序列(CDS)部分的克隆进行完整插入测序获得的。虽然牛基因组测序计划即将完成,但目前缺乏支持牛基因组组装和功能基因组学研究的牛全长CDS mRNA和蛋白质序列数据。因此,高质量牛全长CDS cDNA序列的产生将加强牛基因组组装以及对牛基因和基因产物的功能研究。本研究的目的是从非全长富集cDNA文库中鉴定出的克隆中鉴定和表征牛转录本的全长CDS序列。与最近的几项全长cDNA研究不同,这些全长CDS cDNA是通过将牛EST序列与现有的人类mRNA进行比较来鉴定可能的全长插入cDNA(FLIC)测序的全长CDS克隆,从而在没有目标生物体基因组序列的情况下进行选择、测序和注释。
来自954个牛FLIC序列(bFLICs;平均长度1713 nt,代表762个不同位点)的预测牛蛋白质长度、5'UTR长度和科扎克共有序列均与先前测序的哺乳动物全长转录本一致。
在大多数情况下,bFLICs跨越基因的整个CDS,为创建预测的牛蛋白质序列提供了基础,以支持蛋白质组学和比较进化研究以及功能基因组学和基因组注释。结果证明了比较方法在获得其他物种预测蛋白质序列中的实用性。