Faculty of Synthetic Biology, Shenzhen University of Advanced Technology, Shenzhen Key Laboratory of Quantitative Synthetic Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, 1068 Xueyuan Avenue, Nanshan District, Shenzhen, 518055, China.
State Key Laboratory of Chemical Oncogenomics, School of Chemical Biology and Biotechnology, Peking University Shenzhen Graduate School, 2199 Lishui Avenue, Nanshan District, Shenzhen, 518055, China.
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae532.
We introduce a groundbreaking approach: the minimum free energy-based Gaussian Self-Benchmarking (MFE-GSB) framework, designed to combat the myriad of biases inherent in RNA-seq data. Central to our methodology is the MFE concept, facilitating the adoption of a Gaussian distribution model tailored to effectively mitigate all co-existing biases within a k-mer counting scheme. The MFE-GSB framework operates on a sophisticated dual-model system, juxtaposing modeling data of uniform k-mer distribution against the real, observed sequencing data characterized by nonuniform k-mer distributions. The framework applies a Gaussian function, guided by the predetermined parameters-mean and SD-derived from modeling data, to fit unknown sequencing data. This dual comparison allows for the accurate prediction of k-mer abundances across MFE categories, enabling simultaneous correction of biases at the single k-mer level. Through validation with both engineered RNA constructs and human tissue RNA samples, its wide-ranging efficacy and applicability are demonstrated.
基于最小自由能的高斯自基准(MFE-GSB)框架,旨在克服 RNA-seq 数据中固有的多种偏差。我们方法的核心是 MFE 概念,它促进了采用高斯分布模型,该模型可针对有效减轻 k-mer 计数方案中所有共存偏差进行定制。MFE-GSB 框架基于复杂的双模型系统运行,将均匀 k-mer 分布的数据建模与以非均匀 k-分布特征的真实观察测序数据并列。该框架应用高斯函数,由来自建模数据的预定参数-平均值和标准差来指导,以拟合未知测序数据。这种双重比较允许在 MFE 类别中准确预测 k-mer 的丰度,从而能够在单个 k-mer 级别上同时校正偏差。通过对工程 RNA 构建体和人类组织 RNA 样本的验证,证明了其广泛的功效和适用性。