Suppr超能文献

在人类基因组中寻找并扩展源自古代简单序列重复的区域。

Finding and extending ancient simple sequence repeat-derived regions in the human genome.

作者信息

Shortt Jonathan A, Ruggiero Robert P, Cox Corey, Wacholder Aaron C, Pollock David D

机构信息

1Colorado Center for Personalized Medicine, University of Colorado School of Medicine, Aurora, CO 80045 USA.

2Department of Biology, Southeast Missouri State University, Cape Girardeau, MO 63701 USA.

出版信息

Mob DNA. 2020 Feb 17;11:11. doi: 10.1186/s13100-020-00206-y. eCollection 2020.

Abstract

BACKGROUND

Previously, 3% of the human genome has been annotated as simple sequence repeats (SSRs), similar to the proportion annotated as protein coding. The origin of much of the genome is not well annotated, however, and some of the unidentified regions are likely to be ancient SSR-derived regions not identified by current methods. The identification of these regions is complicated because SSRs appear to evolve through complex cycles of expansion and contraction, often interrupted by mutations that alter both the repeated motif and mutation rate. We applied an empirical, kmer-based, approach to identify genome regions that are likely derived from SSRs.

RESULTS

The sequences flanking annotated SSRs are enriched for similar sequences and for SSRs with similar motifs, suggesting that the evolutionary remains of SSR activity abound in regions near obvious SSRs. Using our previously described P-clouds approach, we identified 'SSR-clouds', groups of similar kmers (or 'oligos') that are enriched near a training set of unbroken SSR loci, and then used the SSR-clouds to detect likely SSR-derived regions throughout the genome.

CONCLUSIONS

Our analysis indicates that the amount of likely SSR-derived sequence in the human genome is 6.77%, over twice as much as previous estimates, including millions of newly identified ancient SSR-derived loci. SSR-clouds identified poly-A sequences adjacent to transposable element termini in over 74% of the oldest class of (roughly, ), validating the sensitivity of the approach. Poly-A's annotated by SSR-clouds also had a length distribution that was more consistent with their poly-A origins, with mean about 35 bp even in older . This work demonstrates that the high sensitivity provided by SSR-Clouds improves the detection of SSR-derived regions and will enable deeper analysis of how decaying repeats contribute to genome structure.

摘要

背景

此前,人类基因组的3%已被注释为简单序列重复(SSR),这一比例与注释为蛋白质编码的比例相似。然而,基因组的许多起源并未得到很好的注释,一些未识别的区域可能是当前方法未识别的古老SSR衍生区域。这些区域的识别很复杂,因为SSR似乎通过复杂的扩张和收缩循环进化,常常被改变重复基序和突变率的突变所中断。我们应用了一种基于经验的、基于kmer的方法来识别可能源自SSR的基因组区域。

结果

注释的SSR侧翼序列富含相似序列和具有相似基序的SSR,这表明SSR活性的进化遗迹在明显SSR附近的区域大量存在。使用我们之前描述的P云方法,我们识别出了“SSR云”,即相似kmer(或“寡核苷酸”)的组,这些组在一组连续的SSR位点训练集附近富集,然后使用SSR云来检测全基因组中可能的SSR衍生区域。

结论

我们的分析表明,人类基因组中可能源自SSR的序列量为6.77%,是先前估计值的两倍多,包括数百万个新识别的古老SSR衍生位点。在超过74%的最古老的(大致为)转座元件末端附近,SSR云识别出了与多聚A序列相邻的序列,验证了该方法的敏感性。SSR云注释的多聚A序列的长度分布也与其多聚A起源更一致,即使在更古老的中,平均长度约为35bp。这项工作表明,SSR云提供的高灵敏度提高了对SSR衍生区域的检测,并将有助于更深入地分析衰减重复序列如何影响基因组结构。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e882/7027126/be1e1a58abd4/13100_2020_206_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验