将人类mRNA与参考基因组序列进行分析可揭示潜在的错误、多态性和RNA编辑。

Analysis of human mRNAs with the reference genome sequence reveals potential errors, polymorphisms, and RNA editing.

作者信息

Furey Terrence S, Diekhans Mark, Lu Yontao, Graves Tina A, Oddy Lachlan, Randall-Maher Jennifer, Hillier LaDeana W, Wilson Richard K, Haussler David

机构信息

Center for Biomolecular Science and Engineering, Department of Computer Science, University of California, Santa Cruz, Santa Cruz, California 95064, USA.

出版信息

Genome Res. 2004 Oct;14(10B):2034-40. doi: 10.1101/gr.2467904.

DOI:10.1101/gr.2467904

PMID:15489323

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC528917/

Abstract

The NCBI Reference Sequence (RefSeq) project and the NIH Mammalian Gene Collection (MGC) together define a set of approximately 30,000 nonredundant human mRNA sequences with identified coding regions representing 17,000 distinct loci. These high-quality mRNA sequences allow for the identification of transcribed regions in the human genome sequence, and many researchers accept them as the correct representation of each defined gene sequence. Computational comparison of these mRNA sequences and the recently published essentially finished human genome sequence reveals several thousand undocumented nonsynonymous substitution and frame shift discrepancies between the two resources. Additional analysis is undertaken to verify that the euchromatic human genome is sufficiently complete--containing nearly the whole mRNA collection, thus allowing for a comprehensive analysis to be undertaken. Many of the discrepancies will prove to be genuine polymorphisms in the human population, somatic cell genomic variants, or examples of RNA editing. It is observed that the genome sequence variant has significant additional support from other mRNAs and ESTs, almost four times more often than does the mRNA variant, suggesting that the genome sequence is more accurate. In approximately 15% of these cases, there is substantial support for both variants, suggestive of an undocumented polymorphism. An initial screening against a 24-individual genomic DNA diversity panel verified 60% of a small set of potential single nucleotide polymorphisms from which successful results could be obtained. We also find statistical evidence that a few of these discrepancies are due to RNA editing. Overall, these results suggest that the mRNA collections may contain a substantial number of errors. For current and future mRNA collections, it may be prudent to fully reconcile each genome sequence discrepancy, classifying each as a polymorphism, site of RNA editing or somatic cell variation, or genome sequence error.

摘要

美国国立生物技术信息中心（NCBI）的参考序列（RefSeq）项目与美国国立卫生研究院（NIH）的哺乳动物基因集（MGC）共同定义了一组约30,000条非冗余人类mRNA序列，这些序列具有已识别的编码区，代表17,000个不同的基因座。这些高质量的mRNA序列有助于在人类基因组序列中识别转录区域，许多研究人员将它们视为每个已定义基因序列的正确代表。对这些mRNA序列与最近公布的基本完成的人类基因组序列进行计算比较，发现这两种资源之间存在数千个未记录的非同义替换和移码差异。进行了额外的分析，以验证常染色体人类基因组是否足够完整——包含了几乎整个mRNA集合，从而能够进行全面的分析。许多差异将被证明是人类群体中的真正多态性、体细胞基因组变异或RNA编辑的例子。据观察，基因组序列变异比mRNA变异从其他mRNA和EST获得的额外支持显著更多，几乎是其四倍，这表明基因组序列更准确。在大约15%的这些案例中，两种变异都有大量支持，提示存在未记录的多态性。对一个24人基因组DNA多样性面板进行的初步筛选验证了一小部分潜在单核苷酸多态性中的60%，从中可以获得成功的结果。我们还发现统计证据表明，其中一些差异是由于RNA编辑造成的。总体而言，这些结果表明mRNA集合可能包含大量错误。对于当前和未来的mRNA集合，谨慎的做法可能是全面核对每个基因组序列差异，将每个差异分类为多态性、RNA编辑位点或体细胞变异位点，或者基因组序列错误。

相似文献

Analysis of human mRNAs with the reference genome sequence reveals potential errors, polymorphisms, and RNA editing.

Genome Res. 2004 Oct;14(10B):2034-40. doi: 10.1101/gr.2467904.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

[Correction of five different types of errors of model REFSEQs appeared in NCBI human gene database only by using two novel human genes C17orf32 and ZNF362].

Yi Chuan Xue Bao. 2004 Apr;31(4):325-34.

RNA editing: trypanosomes rewrite the genetic code.

Verh K Acad Geneeskd Belg. 1998;60(1):63-74.

Characterization of 954 bovine full-CDS cDNA sequences.

BMC Genomics. 2005 Nov 23;6:166. doi: 10.1186/1471-2164-6-166.

In silico whole-genome screening for cancer-related single-nucleotide polymorphisms located in human mRNA untranslated regions.

BMC Genomics. 2007 Jan 3;8:2. doi: 10.1186/1471-2164-8-2.

Genome-scale analysis of human mRNA 5' coding sequences based on expressed sequence tag (EST) database.

Genomics. 2012 Aug;100(2):125-30. doi: 10.1016/j.ygeno.2012.05.012. Epub 2012 May 31.

Single nucleotide polymorphism hunting in cyberspace.

Hum Mutat. 1998;12(4):221-5. doi: 10.1002/(SICI)1098-1004(1998)12:4<221::AID-HUMU1>3.0.CO;2-I.

Closing gaps in the human genome with fosmid resources generated from multiple individuals.

Nat Genet. 2008 Jan;40(1):96-101. doi: 10.1038/ng.2007.34. Epub 2007 Dec 23.

Polymorphic segmental duplications at 8p23.1 challenge the determination of individual defensin gene repertoires and the assembly of a contiguous human reference sequence.

BMC Genomics. 2004 Dec 10;5(1):92. doi: 10.1186/1471-2164-5-92.

引用本文的文献

Adaptive Evolution and Transcriptomic Specialization of P450 Detoxification Genes in the Colorado Potato Beetle Across Developmental Stages and Tissues.

Insects. 2025 Jun 9;16(6):608. doi: 10.3390/insects16060608.

Chemical nucleases are a robust alternative for RNase H cleavage of human ribosomal RNA.

PLoS One. 2025 Feb 24;20(2):e0318697. doi: 10.1371/journal.pone.0318697. eCollection 2025.

Mutational profiling in acute lymphoblastic leukemia by RNA sequencing and chromosomal genomic array testing.

Cancer Med. 2021 Aug;10(16):5629-5642. doi: 10.1002/cam4.4101. Epub 2021 Jul 20.

Artificial Intelligence (AI)-Based Systems Biology Approaches in Multi-Omics Data Analysis of Cancer.

Front Oncol. 2020 Oct 14;10:588221. doi: 10.3389/fonc.2020.588221. eCollection 2020.

Retrocopy contributions to the evolution of the human genome.

BMC Genomics. 2008 Oct 8;9:466. doi: 10.1186/1471-2164-9-466.

Distilling artificial recombinants from large sets of complete mtDNA genomes.

PLoS One. 2008 Aug 20;3(8):e3016. doi: 10.1371/journal.pone.0003016.

Violating the splicing rules: TG dinucleotides function as alternative 3' splice sites in U2-dependent introns.

Genome Biol. 2007;8(8):R154. doi: 10.1186/gb-2007-8-8-r154.

Systematic identification of pseudogenes through whole genome expression evidence profiling.

Nucleic Acids Res. 2006;34(16):4477-85. doi: 10.1093/nar/gkl591. Epub 2006 Aug 31.

Identification and analysis of genes and pseudogenes within duplicated regions in the human and mouse genomes.

PLoS Comput Biol. 2006 Jun 30;2(6):e76. doi: 10.1371/journal.pcbi.0020076. Epub 2006 May 16.

Genetic algorithm learning as a robust approach to RNA editing site prediction.

BMC Bioinformatics. 2006 Mar 16;7:145. doi: 10.1186/1471-2105-7-145.

本文引用的文献

The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC).

Genome Res. 2004 Oct;14(10B):2121-7. doi: 10.1101/gr.2596504.

Quality assessment of the human genome sequence.

Nature. 2004 May 27;429(6990):365-8. doi: 10.1038/nature02390.

Concatenation cDNA sequencing for transcriptome analysis.

C R Biol. 2003 Oct-Nov;326(10-11):971-7. doi: 10.1016/j.crvi.2003.09.032.

DDBJ in the stream of various biological data.

Nucleic Acids Res. 2004 Jan 1;32(Database issue):D31-4. doi: 10.1093/nar/gkh127.

The EMBL Nucleotide Sequence Database.

Nucleic Acids Res. 2004 Jan 1;32(Database issue):D27-30. doi: 10.1093/nar/gkh120.

Low editing efficiency of GluR2 mRNA is associated with a low relative abundance of ADAR2 mRNA in white matter of normal human brain.

Eur J Neurosci. 2003 Jul;18(1):23-33. doi: 10.1046/j.1460-9568.2003.02718.x.

A Drosophila full-length cDNA resource.

Genome Biol. 2002;3(12):RESEARCH0080. doi: 10.1186/gb-2002-3-12-research0080. Epub 2002 Dec 23.

SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data.

Nucleic Acids Res. 2003 Jan 1;31(1):219-23. doi: 10.1093/nar/gkg014.

NCBI Reference Sequence project: update and current status.

Nucleic Acids Res. 2003 Jan 1;31(1):34-7. doi: 10.1093/nar/gkg111.

Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences.

Proc Natl Acad Sci U S A. 2002 Dec 24;99(26):16899-903. doi: 10.1073/pnas.242603899. Epub 2002 Dec 11.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

将人类mRNA与参考基因组序列进行分析可揭示潜在的错误、多态性和RNA编辑。

Analysis of human mRNAs with the reference genome sequence reveals potential errors, polymorphisms, and RNA editing.

作者信息

Furey Terrence S, Diekhans Mark, Lu Yontao, Graves Tina A, Oddy Lachlan, Randall-Maher Jennifer, Hillier LaDeana W, Wilson Richard K, Haussler David

机构信息

Center for Biomolecular Science and Engineering, Department of Computer Science, University of California, Santa Cruz, Santa Cruz, California 95064, USA.

出版信息

Genome Res. 2004 Oct;14(10B):2034-40. doi: 10.1101/gr.2467904.

DOI:10.1101/gr.2467904

PMID:15489323

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC528917/

Abstract

摘要

将人类mRNA与参考基因组序列进行分析可揭示潜在的错误、多态性和RNA编辑。

Analysis of human mRNAs with the reference genome sequence reveals potential errors, polymorphisms, and RNA editing.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

将人类mRNA与参考基因组序列进行分析可揭示潜在的错误、多态性和RNA编辑。

Analysis of human mRNAs with the reference genome sequence reveals potential errors, polymorphisms, and RNA editing.

作者信息

机构信息

出版信息