通过大规模 RNA-Seq 和蛋白质组学数据集分析改进水稻基因组注释。
Improvements to the Rice Genome Annotation Through Large-Scale Analysis of RNA-Seq and Proteomics Data Sets.
机构信息
From the ‡BGI-Shenzhen, Shenzhen 518083, China.
§Institute of Integrative Biology, University of Liverpool, Liverpool, L69 7ZB, UK.
出版信息
Mol Cell Proteomics. 2019 Jan;18(1):86-98. doi: 10.1074/mcp.RA118.000832. Epub 2018 Oct 6.
Rice () is one of the most important worldwide crops. The genome has been available for over 10 years and has undergone several rounds of annotation. We created a comprehensive database of transcripts from 29 public RNA sequencing data sets, officially predicted genes from Ensembl plants, and common contaminants in which to search for protein-level evidence. We re-analyzed nine publicly accessible rice proteomics data sets. In total, we identified 420K peptide spectrum matches from 47K peptides and 8,187 protein groups. 4168 peptides were initially classed as putative novel peptides (not matching official genes). Following a strict filtration scheme to rule out other possible explanations, we discovered 1,584 high confidence novel peptides. The novel peptides were clustered into 692 genomic loci where our results suggest annotation improvements. 80% of the novel peptides had an ortholog match in the curated protein sequence set from at least one other plant species. For the peptides clustering in intergenic regions (and thus potentially new genes), 101 loci were identified, for which 43 had a high-confidence hit for a protein domain. Our results can be displayed as tracks on the Ensembl genome or other browsers supporting Track Hubs, to support re-annotation of the rice genome.
水稻()是世界上最重要的作物之一。其基因组已经公布超过 10 年,并经历了几轮注释。我们创建了一个综合的转录本数据库,其中包含 29 个公共 RNA 测序数据集、Ensembl plants 中正式预测的基因以及常见的污染物,以在蛋白质水平上搜索证据。我们重新分析了 9 个公开可用的水稻蛋白质组学数据集。总共,我们从 47K 个肽段和 8187 个蛋白质组中鉴定出了 420K 个肽段谱匹配。最初有 4168 个肽段被归类为假定的新肽段(与官方基因不匹配)。经过严格的过滤方案排除其他可能的解释后,我们发现了 1584 个高置信度的新肽段。这些新肽段被聚类到 692 个基因组位点,我们的结果表明这些位点需要进行注释改进。80%的新肽段在至少一种其他植物物种的经过精心整理的蛋白质序列集中有直系同源物匹配。对于聚类在基因间区域(因此可能是新基因)的肽段,鉴定出 101 个基因座,其中 43 个基因座具有蛋白质结构域的高置信度命中。我们的结果可以显示在 Ensembl 基因组或其他支持 Track Hub 的浏览器上,以支持水稻基因组的重新注释。