Giannoukakos Stavros, D'Ambrosi Silvia, Koppers-Lalic Danijela, Gómez-Martín Cristina, Fernandez Alberto, Hackenberg Michael
Department of Genetics, Faculty of Science, University of Granada, Granada, 18071, Spain.
Bioinformatics Laboratory, Biomedical Research Centre (CIBM), PTS, Granada, 18100, Spain.
Heliyon. 2024 Mar 12;10(6):e27360. doi: 10.1016/j.heliyon.2024.e27360. eCollection 2024 Mar 30.
Liquid biopsy-derived RNA sequencing (lbRNA-seq) exhibits significant promise for clinic-oriented cancer diagnostics due to its non-invasiveness and ease of repeatability. Despite substantial advancements, obstacles like technical artefacts and process standardisation impede seamless clinical integration. Alongside addressing technical aspects such as normalising fluctuating low-input material and establishing a standardised clinical workflow, the lack of result validation using independent datasets remains a critical factor contributing to the often low reproducibility of liquid biopsy-detected biomarkers. Considering the outlined drawbacks, our objective was to establish a workflow/methodology characterised by: 1. Harness the rich diversity of biological features accessible through lbRNA-seq data, encompassing a holistic range of molecular and functional attributes. These components are seamlessly integrated via a Machine Learning-based Ensemble Classification framework, enabling a unified and comprehensive analysis of the intricate information encoded within the data. 2. Implementing and rigorously benchmarking intra-sample normalisation methods to heighten their relevance within clinical settings. 3. Thoroughly assessing its efficacy across independent test sets to ascertain its robustness and potential utility. Using ten datasets from several studies comprising three different sources of biological material, we first show that while the best-performing normalisation methods depend strongly on the dataset and coupled Machine Learning method, the rather simple Counts Per Million method is generally very robust, showing comparable performance to cross-sample methods. Subsequently, we demonstrate that the innovative biofeature types introduced in this study, such as the Fraction of Canonical Transcript, harbour complementary information. Consequently, their inclusion consistently enhances prediction power compared to models relying solely on gene expression-based biofeatures. Finally, we demonstrate that the workflow is robust on completely independent datasets, generally from different labs and/or different protocols. Taken together, the workflow presented here outperforms generally employed methods in prediction accuracy and may hold potential for clinical diagnostics application due to its specific design.
液体活检衍生的RNA测序(lbRNA-seq)因其非侵入性和易于重复性,在面向临床的癌症诊断中展现出巨大潜力。尽管取得了重大进展,但技术假象和流程标准化等障碍阻碍了其与临床的无缝整合。除了解决诸如对波动的低输入材料进行归一化以及建立标准化临床工作流程等技术问题外,缺乏使用独立数据集进行结果验证仍然是导致液体活检检测到的生物标志物重复性往往较低的关键因素。考虑到上述缺点,我们的目标是建立一种具有以下特点的工作流程/方法:1. 利用通过lbRNA-seq数据可获取的丰富多样的生物学特征,涵盖分子和功能属性的全面范围。这些组件通过基于机器学习的集成分类框架无缝集成,能够对数据中编码的复杂信息进行统一和全面的分析。2. 实施并严格基准测试样本内归一化方法,以提高其在临床环境中的相关性。3. 在独立测试集上全面评估其功效,以确定其稳健性和潜在效用。使用来自几项研究的十个数据集,这些数据集包含三种不同来源的生物材料,我们首先表明,虽然表现最佳的归一化方法强烈依赖于数据集和相关的机器学习方法,但相当简单的每百万计数法通常非常稳健,其性能与跨样本方法相当。随后,我们证明了本研究中引入的创新生物特征类型,如标准转录本分数,具有互补信息。因此,与仅依赖基于基因表达的生物特征的模型相比,将它们纳入始终能提高预测能力。最后,我们证明该工作流程在完全独立的数据集上是稳健的,这些数据集通常来自不同的实验室和/或不同的方案。综上所述,本文提出的工作流程在预测准确性方面优于一般采用的方法,并且由于其特定设计可能在临床诊断应用中具有潜力。