Furey Terrence S, Diekhans Mark, Lu Yontao, Graves Tina A, Oddy Lachlan, Randall-Maher Jennifer, Hillier LaDeana W, Wilson Richard K, Haussler David
Center for Biomolecular Science and Engineering, Department of Computer Science, University of California, Santa Cruz, Santa Cruz, California 95064, USA.
Genome Res. 2004 Oct;14(10B):2034-40. doi: 10.1101/gr.2467904.
The NCBI Reference Sequence (RefSeq) project and the NIH Mammalian Gene Collection (MGC) together define a set of approximately 30,000 nonredundant human mRNA sequences with identified coding regions representing 17,000 distinct loci. These high-quality mRNA sequences allow for the identification of transcribed regions in the human genome sequence, and many researchers accept them as the correct representation of each defined gene sequence. Computational comparison of these mRNA sequences and the recently published essentially finished human genome sequence reveals several thousand undocumented nonsynonymous substitution and frame shift discrepancies between the two resources. Additional analysis is undertaken to verify that the euchromatic human genome is sufficiently complete--containing nearly the whole mRNA collection, thus allowing for a comprehensive analysis to be undertaken. Many of the discrepancies will prove to be genuine polymorphisms in the human population, somatic cell genomic variants, or examples of RNA editing. It is observed that the genome sequence variant has significant additional support from other mRNAs and ESTs, almost four times more often than does the mRNA variant, suggesting that the genome sequence is more accurate. In approximately 15% of these cases, there is substantial support for both variants, suggestive of an undocumented polymorphism. An initial screening against a 24-individual genomic DNA diversity panel verified 60% of a small set of potential single nucleotide polymorphisms from which successful results could be obtained. We also find statistical evidence that a few of these discrepancies are due to RNA editing. Overall, these results suggest that the mRNA collections may contain a substantial number of errors. For current and future mRNA collections, it may be prudent to fully reconcile each genome sequence discrepancy, classifying each as a polymorphism, site of RNA editing or somatic cell variation, or genome sequence error.
美国国立生物技术信息中心(NCBI)的参考序列(RefSeq)项目与美国国立卫生研究院(NIH)的哺乳动物基因集(MGC)共同定义了一组约30,000条非冗余人类mRNA序列,这些序列具有已识别的编码区,代表17,000个不同的基因座。这些高质量的mRNA序列有助于在人类基因组序列中识别转录区域,许多研究人员将它们视为每个已定义基因序列的正确代表。对这些mRNA序列与最近公布的基本完成的人类基因组序列进行计算比较,发现这两种资源之间存在数千个未记录的非同义替换和移码差异。进行了额外的分析,以验证常染色体人类基因组是否足够完整——包含了几乎整个mRNA集合,从而能够进行全面的分析。许多差异将被证明是人类群体中的真正多态性、体细胞基因组变异或RNA编辑的例子。据观察,基因组序列变异比mRNA变异从其他mRNA和EST获得的额外支持显著更多,几乎是其四倍,这表明基因组序列更准确。在大约15%的这些案例中,两种变异都有大量支持,提示存在未记录的多态性。对一个24人基因组DNA多样性面板进行的初步筛选验证了一小部分潜在单核苷酸多态性中的60%,从中可以获得成功的结果。我们还发现统计证据表明,其中一些差异是由于RNA编辑造成的。总体而言,这些结果表明mRNA集合可能包含大量错误。对于当前和未来的mRNA集合,谨慎的做法可能是全面核对每个基因组序列差异,将每个差异分类为多态性、RNA编辑位点或体细胞变异位点,或者基因组序列错误。