Bioinformatics Group, Department of Computer Science, University of Leipzig, Leipzig, Germany.
Bioinformatics. 2011 Jul 15;27(14):1894-900. doi: 10.1093/bioinformatics/btr314. Epub 2011 May 26.
Long non-coding RNAs (lncRNAs) resemble protein-coding mRNAs but do not encode proteins. Most lncRNAs are under lower sequence constraints than protein-coding genes and lack conserved secondary structures, making it hard to predict them computationally.
We introduce an approach to predict spliced lncRNAs in vertebrate genomes combining comparative genomics and machine learning. It is based on detecting signatures of characteristic splice site evolution in vertebrate whole genome alignments. First, we predict individual splice sites, then assemble compatible sites into exon candidates, and finally predict multi-exon transcripts. Using a novel method to evaluate typical splice site substitution patterns that explicitly takes the species phylogeny into account, we show that individual splice sites can be accurately predicted. Since our approach relies only on predicted splice sites, it can uncover both coding and non-coding exons. We show that our predicted exons and partial transcripts are mostly non-coding and lack conserved secondary structures. These exons are of particular interest, since existing computational approaches cannot detect them. Transcriptome sequencing data indicate tissue-specific expression patterns of predicted exons and there is evidence that increasing sequencing depth and breadth will validate additional predictions. We also found a significant enrichment of predicted exons that form multi-exon transcript parts, and we experimentally validate such a novel multi-exon gene. Overall, we obtain 336 novel multi-exon transcript predictions from human intergenic regions. Our results indicate the existence of novel human transcripts that are conserved in evolution and our approach contributes to the completion of the human transcript catalog.
Predicted human splice sites, exons and gene structures together with a Perl implementation of the tree-based log-odds scoring and a supplementary PDF file containing additional figures and tables are available at: http://www.bioinf.uni-leipzig.de/publications/supplements/10-010. The five experimentally confirmed partial transcript isoforms have been deposited in GenBank under accession numbers HM587422-HM587426.
长非编码 RNA(lncRNA)类似于编码蛋白质的 mRNA,但不编码蛋白质。大多数 lncRNA 的序列约束比编码蛋白的基因低,并且缺乏保守的二级结构,这使得很难通过计算进行预测。
我们介绍了一种在脊椎动物基因组中预测拼接 lncRNA 的方法,该方法结合了比较基因组学和机器学习。它基于在脊椎动物全基因组比对中检测特征剪接位点进化的特征。首先,我们预测单个剪接位点,然后将兼容的位点组装成外显子候选物,最后预测多外显子转录本。使用一种新的方法来评估典型的剪接位点替代模式,该方法明确考虑了物种系统发育,我们表明可以准确预测单个剪接位点。由于我们的方法仅依赖于预测的剪接位点,因此它可以揭示编码和非编码外显子。我们表明,我们预测的外显子和部分转录本主要是非编码的,并且缺乏保守的二级结构。这些外显子特别有趣,因为现有的计算方法无法检测到它们。转录组测序数据表明预测外显子具有组织特异性表达模式,并且有证据表明增加测序深度和广度将验证更多的预测。我们还发现了形成多外显子转录部分的预测外显子的显著富集,并且我们实验验证了这样一个新的多外显子基因。总体而言,我们从人类基因间区获得了 336 个新的多外显子转录本预测。我们的结果表明存在新的人类转录本,这些转录本在进化中是保守的,我们的方法有助于完成人类转录本目录。
预测的人类剪接位点、外显子和基因结构以及基于树的对数几率评分的 Perl 实现,以及包含更多图形和表格的补充 PDF 文件可在以下网址获得:http://www.bioinf.uni-leipzig.de/publications/supplements/10-010。五个经实验验证的部分转录本异构体已在 GenBank 中以 HM587422-HM587426 的 accession numbers 提交。