Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
Cardiovascular Disease Initiative and Precision Cardiology Laboratory, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
Nature. 2023 Jun;618(7965):616-624. doi: 10.1038/s41586-023-06139-9. Epub 2023 May 31.
Mapping gene networks requires large amounts of transcriptomic data to learn the connections between genes, which impedes discoveries in settings with limited data, including rare diseases and diseases affecting clinically inaccessible tissues. Recently, transfer learning has revolutionized fields such as natural language understanding and computer vision by leveraging deep learning models pretrained on large-scale general datasets that can then be fine-tuned towards a vast array of downstream tasks with limited task-specific data. Here, we developed a context-aware, attention-based deep learning model, Geneformer, pretrained on a large-scale corpus of about 30 million single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology. During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the attention weights of the model in a completely self-supervised manner. Fine-tuning towards a diverse panel of downstream tasks relevant to chromatin and network dynamics using limited task-specific data demonstrated that Geneformer consistently boosted predictive accuracy. Applied to disease modelling with limited patient data, Geneformer identified candidate therapeutic targets for cardiomyopathy. Overall, Geneformer represents a pretrained deep learning model from which fine-tuning towards a broad range of downstream applications can be pursued to accelerate discovery of key network regulators and candidate therapeutic targets.
基因网络的绘制需要大量转录组数据来了解基因之间的联系,但在数据有限的情况下,包括罕见病和临床不可及组织的疾病,这一过程受到阻碍。最近,通过利用在大规模通用数据集上预训练的深度学习模型,转移学习彻底改变了自然语言理解和计算机视觉等领域,这些模型可以通过使用有限的特定于任务的数据进行微调,针对各种下游任务。在这里,我们开发了一种基于上下文感知和注意力的深度学习模型 Geneformer,它在一个包含约 3000 万个单细胞转录组的大规模语料库上进行预训练,以便在网络生物学中数据有限的情况下实现特定于上下文的预测。在预训练期间,Geneformer 通过模型的注意力权重以完全自我监督的方式,对网络层次结构进行编码,从而对网络动态获得了基本的理解。使用有限的特定于任务的数据针对与染色质和网络动态相关的各种下游任务进行微调表明,Geneformer 始终提高了预测准确性。将其应用于具有有限患者数据的疾病建模,Geneformer 确定了心肌病的候选治疗靶点。总体而言,Geneformer 代表了一个经过预训练的深度学习模型,可针对广泛的下游应用进行微调,从而加速关键网络调节剂和候选治疗靶点的发现。