Department of Plant Biology, Michigan State University, East Lansing, Michigan 48824, USA.
Plant Physiol. 2013 Jan;161(1):210-24. doi: 10.1104/pp.112.205245. Epub 2012 Nov 6.
The Arabidopsis (Arabidopsis thaliana) genome is the most well-annotated plant genome. However, transcriptome sequencing in Arabidopsis continues to suggest the presence of polyadenylated (polyA) transcripts originating from presumed intergenic regions. It is not clear whether these transcripts represent novel noncoding or protein-coding genes. To understand the nature of intergenic polyA transcription, we first assessed its abundance using multiple messenger RNA sequencing data sets. We found 6,545 intergenic transcribed fragments (ITFs) occupying 3.6% of Arabidopsis intergenic space. In contrast to transcribed fragments that map to protein-coding and RNA genes, most ITFs are significantly shorter, are expressed at significantly lower levels, and tend to be more data set specific. A surprisingly large number of ITFs (32.1%) may be protein coding based on evidence of translation. However, our results indicate that these "translated" ITFs tend to be close to and are likely associated with known genes. To investigate if ITFs are under selection and are functional, we assessed ITF conservation through cross-species as well as within-species comparisons. Our analysis reveals that 237 ITFs, including 49 with translation evidence, are under strong selective constraint and relatively distant from annotated features. These ITFs are likely parts of novel genes. However, the selective pressure imposed on most ITFs is similar to that of randomly selected, untranscribed intergenic sequences. Our findings indicate that despite the prevalence of ITFs, apart from the possibility of genomic contamination, many may be background or noisy transcripts derived from "junk" DNA, whose production may be inherent to the process of transcription and which, on rare occasions, may act as catalysts for the creation of novel genes.
拟南芥(Arabidopsis thaliana)基因组是注释最完善的植物基因组。然而,拟南芥的转录组测序仍表明存在来自假定基因间区的多聚腺苷酸化(polyA)转录本。这些转录本是否代表新的非编码或编码蛋白基因尚不清楚。为了了解基因间多聚 A 转录的性质,我们首先使用多个信使 RNA 测序数据集评估其丰度。我们发现 6545 个基因间转录片段(ITF)占据拟南芥基因间区的 3.6%。与映射到编码蛋白和 RNA 基因的转录片段相比,大多数 ITF 显著较短,表达水平显著较低,并且往往更具数据集特异性。大量的 ITF(32.1%)可能基于翻译的证据是编码蛋白的。然而,我们的结果表明,这些“翻译”的 ITF 往往接近并可能与已知基因相关。为了研究 ITF 是否受到选择并具有功能,我们通过跨物种以及种内比较评估了 ITF 的保守性。我们的分析表明,237 个 ITF,包括 49 个具有翻译证据的 ITF,受到强烈的选择约束,并且与注释特征相对较远。这些 ITF 可能是新基因的一部分。然而,施加在大多数 ITF 上的选择压力与随机选择的、未转录的基因间序列相似。我们的研究结果表明,尽管 ITF 普遍存在,但除了基因组污染的可能性之外,许多可能是源自“垃圾”DNA 的背景或嘈杂转录本,其产生可能是转录过程的固有特性,并且在极少数情况下,可能作为新基因产生的催化剂。