Calico Life Sciences, South San Francisco, California, United States of America.
PLoS Comput Biol. 2020 Jul 20;16(7):e1008050. doi: 10.1371/journal.pcbi.1008050. eCollection 2020 Jul.
Machine learning algorithms trained to predict the regulatory activity of nucleic acid sequences have revealed principles of gene regulation and guided genetic variation analysis. While the human genome has been extensively annotated and studied, model organisms have been less explored. Model organism genomes offer both additional training sequences and unique annotations describing tissue and cell states unavailable in humans. Here, we develop a strategy to train deep convolutional neural networks simultaneously on multiple genomes and apply it to learn sequence predictors for large compendia of human and mouse data. Training on both genomes improves gene expression prediction accuracy on held out and variant sequences. We further demonstrate a novel and powerful approach to apply mouse regulatory models to analyze human genetic variants associated with molecular phenotypes and disease. Together these techniques unleash thousands of non-human epigenetic and transcriptional profiles toward more effective investigation of how gene regulation affects human disease.
机器学习算法经过训练,可以预测核酸序列的调控活性,从而揭示基因调控的原理,并指导遗传变异分析。虽然人类基因组已经得到了广泛的注释和研究,但对模式生物的研究却相对较少。模式生物的基因组不仅提供了更多的训练序列,还提供了独特的注释,描述了人类所没有的组织和细胞状态。在这里,我们开发了一种在多个基因组上同时训练深度卷积神经网络的策略,并将其应用于学习大型人类和小鼠数据集的序列预测器。在两个基因组上进行训练可以提高对保留和变异序列的基因表达预测准确性。我们进一步展示了一种新颖而强大的方法,将小鼠调控模型应用于分析与分子表型和疾病相关的人类遗传变异。这些技术共同释放了数千种非人类的表观遗传和转录谱,以更有效地研究基因调控如何影响人类疾病。