Shirota Matsuyuki, Kinoshita Kengo
Graduate School of Medicine, Tohoku University, Sendai, Miyagi 9808575, Japan Tohoku Medical Megabank Organization, Tohoku University, Sendai, Miyagi 9808575, Japan Graduate School of Information Sciences, Tohoku University, Sendai, Miyagi 9808579, Japan.
Tohoku Medical Megabank Organization, Tohoku University, Sendai, Miyagi 9808575, Japan Graduate School of Information Sciences, Tohoku University, Sendai, Miyagi 9808579, Japan Institute for Development, Aging and Cancer, Tohoku University, Sendai, Miyagi 9808575, Japan
Database (Oxford). 2016 Sep 1;2016. doi: 10.1093/database/baw124. Print 2016.
The protein coding sequences of the human reference genome GRCh38, RefSeq mRNA and UniProt protein databases are sometimes inconsistent with each other, due to polymorphisms in the human population, but the overall landscape of the discordant sequences has not been clarified. In this study, we comprehensively listed the discordant bases and regions between the GRCh38, RefSeq and UniProt reference sequences, based on the genomic coordinates of GRCh38. We observed that the RefSeq sequences are more likely to represent the major alleles than GRCh38 and UniProt, by assigning the alternative allele frequencies of the discordant bases. Since some reference sequences have minor alleles, functional and structural annotations may be performed based on rare alleles in the human population, thereby biasing these analyses. Some of the differences between the RefSeq and GRCh38 account for biological differences due to known RNA-editing sites. The definitions of the coding regions are frequently complicated by possible micro-exons within introns and by SNVs with large alternative allele frequencies near exon-intron boundaries. The mRNA or protein regions missing from GRCh38 were mainly due to small deletions, and these sequences need to be identified. Taken together, our results clarify overall consistency and remaining inconsistency between the reference sequences.
由于人类群体中的多态性,人类参考基因组GRCh38、RefSeq mRNA和UniProt蛋白质数据库的蛋白质编码序列有时会相互不一致,但不一致序列的整体情况尚未明确。在本研究中,我们基于GRCh38的基因组坐标,全面列出了GRCh38、RefSeq和UniProt参考序列之间不一致的碱基和区域。通过指定不一致碱基的替代等位基因频率,我们观察到RefSeq序列比GRCh38和UniProt更有可能代表主要等位基因。由于一些参考序列含有次要等位基因,可能会基于人类群体中的罕见等位基因进行功能和结构注释,从而使这些分析产生偏差。RefSeq和GRCh38之间的一些差异是由已知RNA编辑位点导致的生物学差异。编码区的定义常常因内含子中可能存在的微小外显子以及外显子-内含子边界附近具有较大替代等位基因频率的单核苷酸变异而变得复杂。GRCh38中缺失的mRNA或蛋白质区域主要是由于小的缺失,这些序列需要被识别。综上所述,我们的结果阐明了参考序列之间的整体一致性和剩余的不一致性。