Department of Plant Biology, Michigan State University, East Lansing, MI.
Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI.
Mol Biol Evol. 2018 Jun 1;35(6):1422-1436. doi: 10.1093/molbev/msy035.
With advances in transcript profiling, the presence of transcriptional activities in intergenic regions has been well established. However, whether intergenic expression reflects transcriptional noise or activity of novel genes remains unclear. We identified intergenic transcribed regions (ITRs) in 15 diverse flowering plant species and found that the amount of intergenic expression correlates with genome size, a pattern that could be expected if intergenic expression is largely nonfunctional. To further assess the functionality of ITRs, we first built machine learning models using Arabidopsis thaliana as a model that accurately distinguish functional sequences (benchmark protein-coding and RNA genes) and likely nonfunctional ones (pseudogenes and unexpressed intergenic regions) by integrating 93 biochemical, evolutionary, and sequence-structure features. Next, by applying the models genome-wide, we found that 4,427 ITRs (38%) and 796 annotated ncRNAs (44%) had features significantly similar to benchmark protein-coding or RNA genes and thus were likely parts of functional genes. Approximately 60% of ITRs and ncRNAs were more similar to nonfunctional sequences and were likely transcriptional noise. The predictive framework established here provides not only a comprehensive look at how functional, genic sequences are distinct from likely nonfunctional ones, but also a new way to differentiate novel genes from genomic regions with noisy transcriptional activities.
随着转录谱分析的进展,基因间区转录活性的存在已得到充分证实。然而,基因间表达是否反映转录噪声或新型基因的活性仍不清楚。我们在 15 种不同的开花植物物种中鉴定了基因间转录区(ITR),发现基因间表达的数量与基因组大小相关,如果基因间表达主要是非功能性的,那么这种模式是可以预期的。为了进一步评估 ITR 的功能,我们首先使用拟南芥作为模型,构建了机器学习模型,该模型通过整合 93 种生化、进化和序列结构特征,准确地区分了功能序列(基准蛋白编码和 RNA 基因)和可能非功能序列(假基因和未表达的基因间区)。接下来,通过在全基因组范围内应用这些模型,我们发现 4427 个 ITR(38%)和 796 个注释 ncRNA(44%)具有与基准蛋白编码或 RNA 基因显著相似的特征,因此可能是功能基因的一部分。大约 60%的 ITR 和 ncRNA 与非功能序列更相似,可能是转录噪声。这里建立的预测框架不仅提供了一个全面的视角,了解功能、基因序列与可能非功能序列的区别,还提供了一种从具有转录活性噪声的基因组区域中区分新基因的新方法。