Program in Computational Biology and Bioinformatics, Duke University, Durham, NC, USA.
Center for Genomic and Computational Biology, Duke University Medical School, Durham, NC, USA.
Bioinformatics. 2018 Nov 1;34(21):3616-3623. doi: 10.1093/bioinformatics/bty324.
Genetic variation that disrupts gene function by altering gene splicing between individuals can substantially influence traits and disease. In those cases, accurately predicting the effects of genetic variation on splicing can be highly valuable for investigating the mechanisms underlying those traits and diseases. While methods have been developed to generate high quality computational predictions of gene structures in reference genomes, the same methods perform poorly when used to predict the potentially deleterious effects of genetic changes that alter gene splicing between individuals. Underlying that discrepancy in predictive ability are the common assumptions by reference gene finding algorithms that genes are conserved, well-formed and produce functional proteins.
We describe a probabilistic approach for predicting recent changes to gene structure that may or may not conserve function. The model is applicable to both coding and non-coding genes, and can be trained on existing gene annotations without requiring curated examples of aberrant splicing. We apply this model to the problem of predicting altered splicing patterns in the genomes of individual humans, and we demonstrate that performing gene-structure prediction without relying on conserved coding features is feasible. The model predicts an unexpected abundance of variants that create de novo splice sites, an observation supported by both simulations and empirical data from RNA-seq experiments. While these de novo splice variants are commonly misinterpreted by other tools as coding or non-coding variants of little or no effect, we find that in some cases they can have large effects on splicing activity and protein products and we propose that they may commonly act as cryptic factors in disease.
The software is available from geneprediction.org/SGRF.
Supplementary information is available at Bioinformatics online.
通过改变个体间基因剪接来破坏基因功能的遗传变异,可以显著影响性状和疾病。在这种情况下,准确预测遗传变异对剪接的影响对于研究这些性状和疾病的潜在机制非常有价值。虽然已经开发出了用于生成参考基因组中基因结构的高质量计算预测的方法,但当用于预测改变个体间基因剪接的遗传变化的潜在有害影响时,这些方法的性能就很差。导致预测能力差异的原因是参考基因发现算法的常见假设,即基因是保守的、结构良好的,并产生功能性蛋白质。
我们描述了一种预测基因结构最近变化的概率方法,这些变化可能保留功能,也可能不保留功能。该模型适用于编码和非编码基因,并且可以在不依赖异常剪接的 curated 示例的情况下,在现有基因注释上进行训练。我们将该模型应用于个体人类基因组中改变剪接模式的预测问题,并证明不依赖保守编码特征进行基因结构预测是可行的。该模型预测了大量创建新剪接位点的变体,这一观察结果得到了模拟和来自 RNA-seq 实验的经验数据的支持。虽然这些新剪接变体通常被其他工具错误地解释为编码或非编码变体,对功能影响很小或没有,但我们发现,在某些情况下,它们对剪接活性和蛋白质产物有很大的影响,我们提出它们可能通常作为疾病中的隐匿因子。
软件可从 geneprediction.org/SGRF 获取。
补充信息可在 Bioinformatics 在线获取。