Chao Kuan-Hao, Mao Alan, Liu Anqi, Salzberg Steven L, Pertea Mihaela
Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA.
Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA.
bioRxiv. 2025 Mar 23:2025.03.20.644351. doi: 10.1101/2025.03.20.644351.
The SpliceAI deep learning system is currently one of the most accurate methods for identifying splicing signals directly from DNA sequences. However, its utility is limited by its reliance on older software frameworks and human-centric training data. Here we introduce OpenSpliceAI, a trainable, open-source version of SpliceAI implemented in PyTorch to address these challenges. OpenSpliceAI supports both training from scratch and transfer learning, enabling seamless re-training on species-specific datasets and mitigating human-centric biases. Our experiments show that it achieves faster processing speeds and lower memory usage than the original SpliceAI code, allowing large-scale analyses of extensive genomic regions on a single GPU. Additionally, OpenSpliceAI's flexible architecture makes for easier integration with established machine learning ecosystems, simplifying the development of custom splicing models for different species and applications. We demonstrate that OpenSpliceAI's output is highly concordant with SpliceAI. mutagenesis (ISM) analyses confirm that both models rely on similar sequence features, and calibration experiments demonstrate similar score probability estimates.
SpliceAI深度学习系统是目前直接从DNA序列中识别剪接信号最准确的方法之一。然而,其效用受到对旧软件框架和以人类为中心的训练数据的依赖的限制。在此,我们引入OpenSpliceAI,这是一个在PyTorch中实现的可训练的、开源版本的SpliceAI,以应对这些挑战。OpenSpliceAI支持从头开始训练和迁移学习,能够在特定物种的数据集上无缝重新训练,并减轻以人类为中心的偏差。我们的实验表明,它比原始的SpliceAI代码实现了更快的处理速度和更低的内存使用,允许在单个GPU上对广泛的基因组区域进行大规模分析。此外,OpenSpliceAI灵活的架构使其更易于与既定的机器学习生态系统集成,简化了针对不同物种和应用的定制剪接模型的开发。我们证明OpenSpliceAI的输出与SpliceAI高度一致。诱变(ISM)分析证实,这两个模型都依赖于相似的序列特征,校准实验表明得分概率估计相似。