Suppr超能文献

使用无监督机器学习对人类基因组和六倍体蝙蝠基因组进行比较基因组分析:Mb 级 CpG 和 TFBS 岛。

Comparative genomic analysis of the human genome and six bat genomes using unsupervised machine learning: Mb-level CpG and TFBS islands.

机构信息

Department of Bioscience, Nagahama Institute of Bio-Science and Technology, Tamura-cho 1266, Nagahama-shi, Shiga-ken, 526-0829, Japan.

Smart Information Systems, Faculty of Engineering, Niigata University, Niigata-ken, 950-2181, Japan.

出版信息

BMC Genomics. 2022 Jul 8;23(1):497. doi: 10.1186/s12864-022-08664-9.

Abstract

BACKGROUND

Emerging infectious disease-causing RNA viruses, such as the SARS-CoV-2 and Ebola viruses, are thought to rely on bats as natural reservoir hosts. Since these zoonotic viruses pose a great threat to humans, it is important to characterize the bat genome from multiple perspectives. Unsupervised machine learning methods for extracting novel information from big sequence data without prior knowledge or particular models are highly desirable for obtaining unexpected insights. We previously established a batch-learning self-organizing map (BLSOM) of the oligonucleotide composition that reveals novel genome characteristics from big sequence data.

RESULTS

In this study, using the oligonucleotide BLSOM, we conducted a comparative genomic study of humans and six bat species. BLSOM is an explainable-type machine learning algorithm that reveals the diagnostic oligonucleotides contributing to sequence clustering (self-organization). When unsupervised machine learning reveals unexpected and/or characteristic features, these features can be studied in more detail via the much simpler and more direct standard distribution map method. Based on this combined strategy, we identified the Mb-level enrichment of CG dinucleotide (Mb-level CpG islands) around the termini of bat long-scaffold sequences. In addition, a class of CG-containing oligonucleotides were enriched in the centromeric and pericentromeric regions of human chromosomes. Oligonucleotides longer than tetranucleotides often represent binding motifs for a wide variety of proteins (e.g., transcription factor binding sequences (TFBSs)). By analyzing the penta- and hexanucleotide composition, we observed the evident enrichment of a wide range of hexanucleotide TFBSs in centromeric and pericentromeric heterochromatin regions on all human chromosomes.

CONCLUSION

Function of transcription factors (TFs) beyond their known regulation of gene expression (e.g., TF-mediated looping interactions between two different genomic regions) has received wide attention. The Mb-level TFBS and CpG islands are thought to be involved in the large-scale nuclear organization, such as centromere and telomere clustering. TFBSs, which are enriched in centromeric and pericentromeric heterochromatin regions, are thought to play an important role in the formation of nuclear 3D structures. Our machine learning-based analysis will help us to understand the differential features of nuclear 3D structures in the human and bat genomes.

摘要

背景

新兴传染病 RNA 病毒,如 SARS-CoV-2 和埃博拉病毒,被认为依赖蝙蝠作为自然储存宿主。由于这些人畜共患病病毒对人类构成巨大威胁,因此从多个角度描述蝙蝠基因组非常重要。无监督机器学习方法可在无需先验知识或特定模型的情况下从大数据序列中提取新信息,这是非常需要的,可以获得意想不到的见解。我们之前建立了一个寡核苷酸组成的批处理学习自组织映射 (BLSOM),该方法可从大数据序列中揭示新的基因组特征。

结果

在这项研究中,我们使用寡核苷酸 BLSOM 对人类和六种蝙蝠物种进行了比较基因组研究。BLSOM 是一种可解释的机器学习算法,可揭示导致序列聚类(自组织)的诊断性寡核苷酸。当无监督机器学习揭示出意想不到和/或特征性的特征时,可以通过更简单、更直接的标准分布图方法更详细地研究这些特征。基于这种组合策略,我们确定了蝙蝠长链序列末端附近 Mb 级 CG 二核苷酸(Mb 级 CpG 岛)的富集。此外,一类含有 CG 的寡核苷酸在人类染色体的着丝粒和着丝粒周围区域富集。长度超过四核苷酸的寡核苷酸通常代表各种蛋白质的结合基序(例如,转录因子结合序列(TFBS))。通过分析五核苷酸和六核苷酸组成,我们观察到在所有人类染色体的着丝粒和着丝粒周围异染色质区域中广泛存在各种六核苷酸 TFBS 的明显富集。

结论

转录因子(TFs)的功能超出了其对基因表达的已知调节(例如,TF 介导的两个不同基因组区域之间的环相互作用),引起了广泛关注。Mb 级 TFBS 和 CpG 岛被认为参与了大规模的核组织,例如着丝粒和端粒聚类。富含着丝粒和着丝粒周围异染色质区域的 TFBS 被认为在核 3D 结构的形成中起着重要作用。我们基于机器学习的分析将帮助我们了解人类和蝙蝠基因组中核 3D 结构的差异特征。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2988/9264549/df9bcfa48893/12864_2022_8664_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验