Department of Biomedical Informatics, College of Medicine, The Ohio State University, 1800 Cannon Drive, Columbus, OH 43210, USA.
Department of Medicine, Indiana University School of Medicine, 545 Barnhill Drive, Indianapolis, IN 46202, USA.
Gigascience. 2019 May 1;8(5). doi: 10.1093/gigascience/giz046.
Long thought "relics" of evolution, not until recently have pseudogenes been of medical interest regarding regulation in cancer. Often, these regulatory roles are a direct by-product of their close sequence homology to protein-coding genes. Novel pseudogene-gene (PGG) functional associations can be identified through the integration of biomedical data, such as sequence homology, functional pathways, gene expression, pseudogene expression, and microRNA expression. However, not all of the information has been integrated, and almost all previous pseudogene studies relied on 1:1 pseudogene-parent gene relationships without leveraging other homologous genes/pseudogenes.
We produce PGG families that expand beyond the current 1:1 paradigm. First, we construct expansive PGG databases by (i) CUDAlign graphics processing unit (GPU) accelerated local alignment of all pseudogenes to gene families (totaling 1.6 billion individual local alignments and >40,000 GPU hours) and (ii) BLAST-based assignment of pseudogenes to gene families. Second, we create an open-source web application (PseudoFuN [Pseudogene Functional Networks]) to search for integrative functional relationships of sequence homology, microRNA expression, gene expression, pseudogene expression, and gene ontology. We produce four "flavors" of CUDAlign-based databases (>462,000,000 PGG pairwise alignments and 133,770 PGG families) that can be queried and downloaded using PseudoFuN. These databases are consistent with previous 1:1 PGG annotation and also are much more powerful including millions of de novo PGG associations. For example, we find multiple known (e.g., miR-20a-PTEN-PTENP1) and novel (e.g., miR-375-SOX15-PPP4R1L) microRNA-gene-pseudogene associations in prostate cancer. PseudoFuN provides a "one stop shop" for identifying and visualizing thousands of potential regulatory relationships related to pseudogenes in The Cancer Genome Atlas cancers.
Thousands of new PGG associations can be explored in the context of microRNA-gene-pseudogene co-expression and differential expression with a simple-to-use online tool by bioinformaticians and oncologists alike.
长期以来,假基因一直被认为是进化的“遗物”,直到最近才在癌症的调控方面引起医学关注。通常,这些调节作用是其与蛋白质编码基因密切序列同源的直接副产品。通过整合生物医学数据(如序列同源性、功能途径、基因表达、假基因表达和 microRNA 表达),可以识别新的假基因-基因(PGG)功能关联。然而,并非所有信息都已整合,并且几乎所有以前的假基因研究都依赖于 1:1 假基因-父基因关系,而没有利用其他同源基因/假基因。
我们生成了超越当前 1:1 范例的 PGG 家族。首先,我们通过(i)使用 CUDAlign 图形处理单元(GPU)加速所有假基因到基因家族的局部比对(总共进行了 16 亿个单独的局部比对和>40000 个 GPU 小时),以及(ii)基于 BLAST 的假基因到基因家族的分配,构建了扩展的 PGG 数据库。其次,我们创建了一个开源的 Web 应用程序(PseudoFuN[假基因功能网络]),用于搜索序列同源性、microRNA 表达、基因表达、假基因表达和基因本体论的综合功能关系。我们生成了四种基于 CUDAlign 的数据库(>462000000 个 PGG 两两比对和 133770 个 PGG 家族),可以使用 PseudoFuN 进行查询和下载。这些数据库与以前的 1:1 PGG 注释一致,但功能更强大,包括数百万个新的 PGG 关联。例如,我们在前列腺癌中发现了多个已知(例如,miR-20a-PTEN-PTENP1)和新的(例如,miR-375-SOX15-PPP4R1L)microRNA-基因-假基因关联。PseudoFuN 为识别和可视化与癌症基因组图谱癌症中假基因相关的数千种潜在调控关系提供了一个“一站式服务”。
生物信息学家和肿瘤学家都可以使用简单易用的在线工具,在 microRNA-基因-假基因共表达和差异表达的背景下探索数千个新的 PGG 关联。