Balcı Ali Tuğrul, Ebeid Mark Maher, Benos Panayiotis V, Kostka Dennis, Chikina Maria
Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Institution, Pittsburgh, 15213, United States and.
Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, 15213, Unites States and.
bioRxiv. 2023 Mar 28:2023.01.25.525572. doi: 10.1101/2023.01.25.525572.
Sequence-based deep learning approaches have been shown to predict a multitude of functional genomic readouts, including regions of open chromatin and RNA expression of genes. However, a major limitation of current methods is that model interpretation relies on computationally demanding post hoc analyses, and even then, one can often not explain the internal mechanics of highly parameterized models. Here, we introduce a deep learning architecture called tiSFM (totally interpretable sequence to function model). tiSFM improves upon the performance of standard multi-layer convolutional models while using fewer parameters. Additionally, while tiSFM is itself technically a multi-layer neural network, internal model parameters are intrinsically interpretable in terms of relevant sequence motifs.
We analyze published open chromatin measurements across hematopoietic lineage cell-types and demonstrate that tiSFM outperforms a state-of-the-art convolutional neural network model custom-tailored to this dataset. We also show that it correctly identifies context specific activities of transcription factors with known roles in hematopoietic differentiation, including Pax5 and Ebf1 for B-cells, and Rorc for innate lymphoid cells. tiSFM's model parameters have biologically meaningful interpretations, and we show the utility of our approach on a complex task of predicting the change in epigenetic state as a function of developmental transition.
The source code, including scripts for the analysis of key findings, can be found at https://github.com/boooooogey/ATAConv, implemented in Python.
基于序列的深度学习方法已被证明可预测多种功能基因组读数,包括开放染色质区域和基因的RNA表达。然而,当前方法的一个主要限制是模型解释依赖于计算量很大的事后分析,即便如此,人们通常仍无法解释高度参数化模型的内部机制。在此,我们引入一种名为tiSFM(完全可解释的序列到功能模型)的深度学习架构。tiSFM在使用更少参数的情况下提高了标准多层卷积模型的性能。此外,虽然tiSFM本身在技术上是一个多层神经网络,但其内部模型参数在相关序列基序方面具有内在的可解释性。
我们分析了已发表的数据集中造血谱系细胞类型的开放染色质测量结果,并证明tiSFM优于针对此数据集定制的最先进卷积神经网络模型。我们还表明,它能正确识别在造血分化中具有已知作用的转录因子的上下文特定活性,包括B细胞中的Pax5和Ebf1以及先天淋巴细胞中的Rorc。tiSFM的模型参数具有生物学上有意义的解释,并且我们展示了我们的方法在预测作为发育转变函数的表观遗传状态变化这一复杂任务上的效用。
包括关键发现分析脚本在内的源代码可在https://github.com/boooooogey/ATAConv找到,用Python实现。