Minkin Ilia, Salzberg Steven L
Department of Biomedical Engineering, Johns Hopkins University, 3400 N. Charles Street, Baltimore, MD 21218, United States.
Center for Computational Biology, Johns Hopkins University, 3100 Wyman Park Drive, Baltimore, MD 21211, United States.
Nucleic Acids Res. 2025 Mar 20;53(6). doi: 10.1093/nar/gkaf184.
Despite many improvements over the years, the annotation of the human genome remains imperfect. The use of evolutionarily conserved sequences provides a strategy for selecting a high-confidence subset of the annotation. Using the latest whole-genome alignment, we found that splice sites from protein-coding genes in the high-quality MANE annotation are consistently conserved across >350 species. We also studied splice sites from the RefSeq, GENCODE, and CHESS databases not present in MANE. In addition, we analyzed the completeness of the alignment with respect to the human genome annotations and described a method that would allow us to fix up to 60% of the missing alignments of the protein-coding exons. We trained a logistic regression classifier to distinguish between the conservation exhibited by sites from MANE versus sites chosen randomly from neutrally evolving sequences. We found that splice sites classified by our model as well-supported have lower single nucleotide polymorphism rates and better transcriptomic evidence. We then computed a subset of transcripts using only "well-supported" splice sites or ones from MANE. This subset is enriched in high-confidence transcripts of the major gene catalogs that appear to be under purifying selection and are more likely to be correct and functionally relevant.
尽管多年来有了许多改进,但人类基因组注释仍不完善。利用进化保守序列为选择高可信度注释子集提供了一种策略。使用最新的全基因组比对,我们发现高质量MANE注释中蛋白质编码基因的剪接位点在超过350个物种中始终保守。我们还研究了MANE中不存在的RefSeq、GENCODE和CHESS数据库中的剪接位点。此外,我们分析了与人类基因组注释相关的比对完整性,并描述了一种方法,该方法可修复高达60%的蛋白质编码外显子缺失比对。我们训练了一个逻辑回归分类器,以区分MANE位点与从中性进化序列中随机选择的位点所表现出的保守性。我们发现,被我们的模型分类为有充分支持的剪接位点具有较低的单核苷酸多态性率和更好的转录组学证据。然后,我们仅使用“有充分支持的”剪接位点或MANE中的剪接位点计算了一个转录本子集。该子集富含主要基因目录中的高可信度转录本,这些转录本似乎受到纯化选择,更有可能是正确的且与功能相关。