在蛋白质组学中使用全基因组开放阅读框分析进行新型基因和基因模型检测。

Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics.

作者信息

Fermin Damian, Allen Baxter B, Blackwell Thomas W, Menon Rajasree, Adamski Marcin, Xu Yin, Ulintz Peter, Omenn Gilbert S, States David J

机构信息

Bioinformatics Program, University of Michigan, Ann Arbor, MI 48109, USA.

出版信息

Genome Biol. 2006;7(4):R35. doi: 10.1186/gb-2006-7-4-r35. Epub 2006 Apr 28.

DOI:10.1186/gb-2006-7-4-r35

PMID:16646984

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1557991/

Abstract

BACKGROUND

Defining the location of genes and the precise nature of gene products remains a fundamental challenge in genome annotation. Interrogating tandem mass spectrometry data using genomic sequence provides an unbiased method to identify novel translation products. A six-frame translation of the entire human genome was used as the query database to search for novel blood proteins in the data from the Human Proteome Organization Plasma Proteome Project. Because this target database is orders of magnitude larger than the databases traditionally employed in tandem mass spectra analysis, careful attention to significance testing is required. Confidence of identification is assessed using our previously described Poisson statistic, which estimates the significance of multi-peptide identifications incorporating the length of the matching sequence, number of spectra searched and size of the target sequence database.

RESULTS

Applying a false discovery rate threshold of 0.05, we identified 282 significant open reading frames, each containing two or more peptide matches. There were 627 novel peptides associated with these open reading frames that mapped to a unique genomic coordinate placed within the start/stop points of previously annotated genes. These peptides matched 1,110 distinct tandem MS spectra. Peptides fell into four categories based upon where their genomic coordinates placed them relative to annotated exons within the parent gene.

CONCLUSION

This work provides evidence for novel alternative splice variants in many previously annotated genes. These findings suggest that annotation of the genome is not yet complete and that proteomics has the potential to further add to our understanding of gene structures.

摘要

背景

确定基因的位置以及基因产物的确切性质仍然是基因组注释中的一项基本挑战。利用基因组序列查询串联质谱数据提供了一种无偏差的方法来识别新的翻译产物。使用整个人类基因组的六框架翻译作为查询数据库，在人类蛋白质组组织血浆蛋白质组计划的数据中搜索新的血液蛋白质。由于这个目标数据库比传统用于串联质谱分析的数据库大几个数量级，因此需要仔细关注显著性检验。使用我们之前描述的泊松统计量评估鉴定的可信度，该统计量估计了结合匹配序列长度、搜索的谱图数量和目标序列数据库大小的多肽鉴定的显著性。

结果

应用错误发现率阈值0.05，我们鉴定出282个显著的开放阅读框，每个开放阅读框包含两个或更多的肽段匹配。有627个与这些开放阅读框相关的新肽段，它们映射到位于先前注释基因的起始/终止点内的唯一基因组坐标。这些肽段匹配了1110个不同的串联质谱谱图。根据其基因组坐标相对于母基因中注释外显子的位置，肽段分为四类。

结论

这项工作为许多先前注释基因中的新型可变剪接变体提供了证据。这些发现表明基因组注释尚未完成，蛋白质组学有可能进一步增进我们对基因结构的理解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a439/1557991/7e8d7fb02707/gb-2006-7-4-r35-1.jpg

相似文献

Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics.在蛋白质组学中使用全基因组开放阅读框分析进行新型基因和基因模型检测。

Genome Biol. 2006;7(4):R35. doi: 10.1186/gb-2006-7-4-r35. Epub 2006 Apr 28.

Identification of new protein coding sequences and signal peptidase cleavage sites of Helicobacter pylori strain 26695 by proteogenomics.通过蛋白质组学鉴定幽门螺杆菌 26695 株的新蛋白编码序列和信号肽切割位点。

J Proteomics. 2013 Jun 28;86:27-42. doi: 10.1016/j.jprot.2013.04.036. Epub 2013 May 9.

Deep coverage of the Escherichia coli proteome enables the assessment of false discovery rates in simple proteogenomic experiments.深度覆盖大肠杆菌蛋白质组可用于评估简单蛋白质基因组实验中的假发现率。

Mol Cell Proteomics. 2013 Nov;12(11):3420-30. doi: 10.1074/mcp.M113.029165. Epub 2013 Aug 1.

Proteomic Detection and Validation of Translated Small Open Reading Frames.翻译后的小开放阅读框的蛋白质组学检测与验证

Curr Protoc Chem Biol. 2019 Dec;11(4):e77. doi: 10.1002/cpch.77.

OpenCustomDB: Integration of Unannotated Open Reading Frames and Genetic Variants to Generate More Comprehensive Customized Protein Databases.OpenCustomDB：整合未注释的开放阅读框和遗传变异以生成更全面的定制蛋白质数据库。

J Proteome Res. 2023 May 5;22(5):1492-1500. doi: 10.1021/acs.jproteome.3c00054. Epub 2023 Mar 24.

Pinstripe: a suite of programs for integrating transcriptomic and proteomic datasets identifies novel proteins and improves differentiation of protein-coding and non-coding genes.Pinstripe：一套用于整合转录组和蛋白质组数据集的程序，可识别新的蛋白质，并提高蛋白质编码和非编码基因的区分能力。

Bioinformatics. 2012 Dec 1;28(23):3042-50. doi: 10.1093/bioinformatics/bts582. Epub 2012 Oct 7.

Detection of alternative splice variants at the proteome level in Aspergillus flavus.在黄曲霉中进行蛋白质组水平的可变剪接变体检测。

J Proteome Res. 2010 Mar 5;9(3):1209-17. doi: 10.1021/pr900602d.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

Identification and analysis of small proteins and short open reading frame encoded peptides in Hep3B cell.鉴定和分析 Hep3B 细胞中的小蛋白和短开放阅读框编码肽。

J Proteomics. 2021 Jan 6;230:103965. doi: 10.1016/j.jprot.2020.103965. Epub 2020 Sep 3.

A proteogenomic analysis of Anopheles gambiae using high-resolution Fourier transform mass spectrometry.使用高分辨率傅里叶变换质谱技术对冈比亚按蚊进行蛋白质基因组分析。

Genome Res. 2011 Nov;21(11):1872-81. doi: 10.1101/gr.127951.111. Epub 2011 Jul 27.

引用本文的文献

Improving the genome and proteome annotations of the marine model diatom using a proteogenomics strategy.利用蛋白质基因组学策略改进海洋模式硅藻的基因组和蛋白质组注释。

Mar Life Sci Technol. 2023 Feb 3;5(1):102-115. doi: 10.1007/s42995-022-00161-y. eCollection 2023 Feb.

Identification of Non-Canonical Translation Products in Using Tandem Mass Spectrometry.使用串联质谱法鉴定非经典翻译产物。

Front Genet. 2021 Oct 25;12:728900. doi: 10.3389/fgene.2021.728900. eCollection 2021.

Splice-Junction-Based Mapping of Alternative Isoforms in the Human Proteome.基于剪接接头的人类蛋白质组中可变剪接异构体的定位。

Cell Rep. 2019 Dec 10;29(11):3751-3765.e5. doi: 10.1016/j.celrep.2019.11.026.

On the Impact of the Pangenome and Annotation Discrepancies While Building Protein Sequence Databases for Bacteria Proteogenomics.关于构建细菌蛋白质基因组学蛋白质序列数据库时泛基因组和注释差异的影响

Front Microbiol. 2019 Jun 20;10:1410. doi: 10.3389/fmicb.2019.01410. eCollection 2019.

Improvements to the Rice Genome Annotation Through Large-Scale Analysis of RNA-Seq and Proteomics Data Sets.通过大规模 RNA-Seq 和蛋白质组学数据集分析改进水稻基因组注释。

Mol Cell Proteomics. 2019 Jan;18(1):86-98. doi: 10.1074/mcp.RA118.000832. Epub 2018 Oct 6.

Integrating Next-Generation Genomic Sequencing and Mass Spectrometry To Estimate Allele-Specific Protein Abundance in Human Brain.将下一代基因组测序和质谱技术集成，以估计人类大脑中等位基因特异性蛋白质丰度。

J Proteome Res. 2017 Sep 1;16(9):3336-3347. doi: 10.1021/acs.jproteome.7b00324. Epub 2017 Aug 9.

Methods, Tools and Current Perspectives in Proteogenomics.蛋白质基因组学中的方法、工具及当前观点

Mol Cell Proteomics. 2017 Jun;16(6):959-981. doi: 10.1074/mcp.MR117.000024. Epub 2017 Apr 29.

A method for identifying discriminative isoform-specific peptides for clinical proteomics application.一种用于临床蛋白质组学应用中鉴定具有区分性的亚型特异性肽段的方法。

BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):522. doi: 10.1186/s12864-016-2907-8.

PGA: an R/Bioconductor package for identification of novel peptides using a customized database derived from RNA-Seq.PGA：一个用于使用源自RNA测序的定制数据库鉴定新型肽段的R/Bioconductor软件包。

BMC Bioinformatics. 2016 Jun 17;17(1):244. doi: 10.1186/s12859-016-1133-3.

Proteogenomics: Integrating Next-Generation Sequencing and Mass Spectrometry to Characterize Human Proteomic Variation.蛋白质基因组学：整合新一代测序技术与质谱技术以表征人类蛋白质组变异

Annu Rev Anal Chem (Palo Alto Calif). 2016 Jun 12;9(1):521-45. doi: 10.1146/annurev-anchem-071015-041722. Epub 2016 Mar 30.

本文引用的文献

Challenges in deriving high-confidence protein identifications from data gathered by a HUPO plasma proteome collaborative study.从人类蛋白质组组织（HUPO）血浆蛋白质组合作研究收集的数据中获得高可信度蛋白质鉴定结果所面临的挑战。

Nat Biotechnol. 2006 Mar;24(3):333-8. doi: 10.1038/nbt1183.

Characterization of the human blood plasma proteome.人类血浆蛋白质组的表征

Proteomics. 2005 Oct;5(15):4034-45. doi: 10.1002/pmic.200401246.

Data management and preliminary data analysis in the pilot phase of the HUPO Plasma Proteome Project.人类蛋白质组组织（HUPO）血浆蛋白质组计划试点阶段的数据管理与初步数据分析

Proteomics. 2005 Aug;5(13):3246-61. doi: 10.1002/pmic.200500186.

Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry.高通量质谱法获得的肽序列与人类基因组的整合。

Genome Biol. 2005;6(1):R9. doi: 10.1186/gb-2004-6-1-r9. Epub 2004 Dec 10.

Ensembl 2005.Ensembl 2005。

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D447-53. doi: 10.1093/nar/gki138.

The Universal Protein Resource (UniProt).通用蛋白质资源（UniProt）。

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D154-9. doi: 10.1093/nar/gki070.

HUPO (Human Proteome Organization) 3rd Annual World Congress. Beijing, China, October 25-27, 2004.人类蛋白质组组织（HUPO）第三届年度世界大会。中国北京，2004年10月25日至27日。

Mol Cell Proteomics. 2004 Oct;3(10 Suppl):S1-352.

Potential for false positive identifications from large databases through tandem mass spectrometry.通过串联质谱法从大型数据库中产生假阳性鉴定结果的可能性。

J Proteome Res. 2004 Sep-Oct;3(5):1082-5. doi: 10.1021/pr049946o.

Combining phylogenetic and hidden Markov models in biosequence analysis.生物序列分析中系统发育模型与隐马尔可夫模型的结合

J Comput Biol. 2004;11(2-3):413-28. doi: 10.1089/1066527041410472.

The International Protein Index: an integrated database for proteomics experiments.国际蛋白质索引：蛋白质组学实验的综合数据库。

Proteomics. 2004 Jul;4(7):1985-8. doi: 10.1002/pmic.200300721.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

在蛋白质组学中使用全基因组开放阅读框分析进行新型基因和基因模型检测。

Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献