Program in Genetics, Genomics, and Bioinformatics, University at Buffalo-State University of New York, Buffalo, United States.
Department of Biology, Miami University, Oxford, United States.
Elife. 2024 Oct 11;13:RP96738. doi: 10.7554/eLife.96738.
Annotation of newly sequenced genomes frequently includes genes, but rarely covers important non-coding genomic features such as the -regulatory modules-e.g., enhancers and silencers-that regulate gene expression. Here, we begin to remedy this situation by developing a workflow for rapid initial annotation of insect regulatory sequences, and provide a searchable database resource with enhancer predictions for 33 genomes. Using our previously developed SCRMshaw computational enhancer prediction method, we predict over 2.8 million regulatory sequences along with the tissues where they are expected to be active, in a set of insect species ranging over 360 million years of evolution. Extensive analysis and validation of the data provides several lines of evidence suggesting that we achieve a high true-positive rate for enhancer prediction. One, we show that our predictions target specific loci, rather than random genomic locations. Two, we predict enhancers in orthologous loci across a diverged set of species to a significantly higher degree than random expectation would allow. Three, we demonstrate that our predictions are highly enriched for regions of accessible chromatin. Four, we achieve a validation rate in excess of 70% using in vivo reporter gene assays. As we continue to annotate both new tissues and new species, our regulatory annotation resource will provide a rich source of data for the research community and will have utility for both small-scale (single gene, single species) and large-scale (many genes, many species) studies of gene regulation. In particular, the ability to search for functionally related regulatory elements in orthologous loci should greatly facilitate studies of enhancer evolution even among distantly related species.
注释新测序的基因组通常包括基因,但很少涵盖重要的非编码基因组特征,如调节基因表达的 -调控模块-例如,增强子和沉默子。在这里,我们通过开发一种快速初始注释昆虫调控序列的工作流程来开始弥补这种情况,并为 33 个基因组提供了一个可搜索的增强子预测数据库资源。使用我们之前开发的 SCRMshaw 计算增强子预测方法,我们预测了超过 280 万个调控序列,以及它们预计活跃的组织,这些序列涵盖了超过 3.6 亿年的进化历史的昆虫物种。对数据的广泛分析和验证提供了几条证据表明,我们实现了高的增强子预测真阳性率。首先,我们表明我们的预测针对特定的基因座,而不是随机的基因组位置。其次,我们在一组分化的物种中预测到了同源基因座中的增强子,其程度明显高于随机预期。第三,我们证明了我们的预测高度富集于可及染色质区域。第四,我们通过体内报告基因检测实现了超过 70%的验证率。随着我们继续注释新的组织和新的物种,我们的调控注释资源将为研究社区提供丰富的数据来源,并且对于小规模(单个基因,单个物种)和大规模(多个基因,多个物种)的基因调控研究都将具有实用性。特别是,在同源基因座中搜索功能相关的调控元件的能力应该极大地促进增强子进化的研究,即使在远缘物种之间也是如此。