Keukeleire Pia, Rosen Jonathan D, Göbel-Knapp Angelina, Salomon Kilian, Schubach Max, Kircher Martin
Institute of Human Genetics, University Hospital Schleswig-Holstein, University of Lübeck, Lübeck, Germany.
Department of Genetics & Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
BMC Bioinformatics. 2025 Feb 13;26(1):52. doi: 10.1186/s12859-025-06065-9.
Massively parallel reporter assays (MPRAs) are an experimental technology for measuring the activity of thousands of candidate regulatory sequences or their variants in parallel, where the activity of individual sequences is measured from pools of sequence-tagged reporter genes. Activity is derived from the ratio of transcribed RNA to input DNA counts of associated tag sequences in each reporter construct, so-called barcodes. Recently, tools specifically designed to analyze MPRA data were developed that attempt to model the count data, accounting for its inherent variation. Of these tools, MPRAnalyze and mpralm are most widely used. MPRAnalyze models barcode counts to estimate the transcription rate of each sequence. While it has increased statistical power and robustness against outliers compared to mpralm, it is slow and has a high false discovery rate. Mpralm, a tool built on the R package Limma, estimates log fold-changes between different sequences. As opposed to MPRAnalyze, it is fast and has a low false discovery rate but is susceptible to outliers and has less statistical power.
We propose BCalm, an MPRA analysis framework aimed at addressing the limitations of the existing tools. BCalm is an adaptation of mpralm, but models individual barcode counts instead of aggregating counts per sequence. Leaving out the aggregation step increases statistical power and improves robustness to outliers, while being fast and precise. We show the improved performance over existing methods on both simulated MPRA data and a lentiviral MPRA library of 166,508 target sequences, including 82,258 allelic variants. Further, BCalm adds functionality beyond the existing mpralm package, such as preparing count input files from MPRAsnakeflow, as well as an option to test for sequences with enhancing or repressing activity. Its built-in plotting functionalities allow for easy interpretation of the results.
With BCalm, we provide a new tool for analyzing MPRA data which is robust and accurate on real MPRA datasets. The package is available at https://github.com/kircherlab/BCalm .
大规模平行报告基因检测(MPRAs)是一种实验技术,可用于并行测量数千个候选调控序列或其变体的活性,其中单个序列的活性是从带有序列标签的报告基因库中测量的。活性由每个报告基因构建体中相关标签序列的转录RNA与输入DNA计数的比率得出,即所谓的条形码。最近,专门设计用于分析MPRA数据的工具被开发出来,这些工具试图对计数数据进行建模,并考虑到其固有的变异性。在这些工具中,MPRAnalyze和mpralm使用最为广泛。MPRAnalyze对条形码计数进行建模,以估计每个序列的转录速率。与mpralm相比,它具有更高的统计功效和对异常值的鲁棒性,但速度较慢且错误发现率较高。Mpralm是一个基于R包Limma构建的工具,用于估计不同序列之间的对数倍变化。与MPRAnalyze不同,它速度快且错误发现率低,但容易受到异常值的影响且统计功效较低。
我们提出了BCalm,这是一个旨在解决现有工具局限性的MPRA分析框架。BCalm是mpralm的一种改进,但它对单个条形码计数进行建模,而不是对每个序列的计数进行汇总。省略汇总步骤可提高统计功效并增强对异常值的鲁棒性,同时快速且精确。我们在模拟的MPRA数据和包含166,508个靶序列(包括82,258个等位基因变体)的慢病毒MPRA文库上展示了其相对于现有方法的改进性能。此外,BCalm还增加了现有mpralm包之外的功能,例如从MPRAsnakeflow准备计数输入文件,以及测试具有增强或抑制活性的序列的选项。其内置的绘图功能便于对结果进行解释。
通过BCalm,我们提供了一种用于分析MPRA数据的新工具,该工具在真实的MPRA数据集上既稳健又准确。该软件包可在https://github.com/kircherlab/BCalm获取。