MetaSystems Research Team, Computational Systems Biology Research Group, Advanced Computational Sciences Department, RIKEN Advanced Science Institute, Yokohama, Japan.
PLoS One. 2010 Oct 12;5(10):e13284. doi: 10.1371/journal.pone.0013284.
From the ENCODE project, it is realized that almost every base of the entire human genome is transcribed. One class of transcripts resulting from this arises from the conjoined gene, which is formed by combining the exons of two or more distinct (parent) genes lying on the same strand of a chromosome. Only a very limited number of such genes are known, and the definition and terminologies used for them are highly variable in the public databases. In this work, we have computationally identified and manually curated 751 conjoined genes (CGs) in the human genome that are supported by at least one mRNA or EST sequence available in the NCBI database. 353 representative CGs, of which 291 (82%) could be confirmed, were subjected to experimental validation using RT-PCR and sequencing methods. We speculate that these genes are arising out of novel functional requirements and are not merely artifacts of transcription, since more than 70% of them are conserved in other vertebrate genomes. The unique splicing patterns exhibited by CGs reveal their possible roles in protein evolution or gene regulation. Novel CGs, for which no transcript is available, could be identified in 80% of randomly selected potential CG forming regions, indicating that their formation is a routine process. Formation of CGs is not only limited to human, as we have also identified 270 CGs in mouse and 227 in drosophila using our approach. Additionally, we propose a novel mechanism for the formation of CGs. Finally, we developed a database, ConjoinG, which contains detailed information about all the CGs (800 in total) identified in the human genome. In summary, our findings reveal new insights about the functionality of CGs in terms of another possible mechanism for gene regulation and genomic evolution and the mechanism leading to their formation.
从 ENCODE 项目中可以发现,人类基因组的几乎每个碱基都能被转录。由这些转录本产生的一类转录本来自于拼接基因,它是由位于同一染色体链上的两个或多个不同(亲本)基因的外显子组合而成的。目前已知的此类基因数量非常有限,而且在公共数据库中用于它们的定义和术语也存在很大的差异。在这项工作中,我们通过计算方法在人类基因组中识别并手动整理了 751 个拼接基因(CGs),这些基因至少有一条来自 NCBI 数据库中 mRNA 或 EST 序列的支持。我们选择了 353 个有代表性的 CGs 进行实验验证,其中 291 个(82%)可以通过 RT-PCR 和测序方法进行验证。我们推测这些基因是由于新的功能需求而产生的,而不仅仅是转录的产物,因为它们中的 70%以上在其他脊椎动物基因组中是保守的。CGs 所展示的独特拼接模式揭示了它们在蛋白质进化或基因调控中的可能作用。在 80%的随机选择的潜在 CG 形成区域中可以识别出没有转录本的新型 CGs,这表明它们的形成是一个常规过程。CGs 的形成不仅限于人类,我们还使用我们的方法在小鼠中鉴定了 270 个 CGs,在果蝇中鉴定了 227 个 CGs。此外,我们提出了一种 CG 形成的新机制。最后,我们开发了一个数据库 ConjoinG,其中包含了在人类基因组中识别出的所有 CGs(总共 800 个)的详细信息。总之,我们的研究结果揭示了 CGs 在基因调控和基因组进化以及导致它们形成的机制方面的新功能的新见解。