College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, B15 2TT, UK.
Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, B15 2TT, UK.
J Transl Med. 2019 May 14;17(1):155. doi: 10.1186/s12967-019-1912-5.
Translational medicine (TM) is an emerging domain that aims to facilitate medical or biological advances efficiently from the scientist to the clinician. Central to the TM vision is to narrow the gap between basic science and applied science in terms of time, cost and early diagnosis of the disease state. Biomarker identification is one of the main challenges within TM. The identification of disease biomarkers from -omics data will not only help the stratification of diverse patient cohorts but will also provide early diagnostic information which could improve patient management and potentially prevent adverse outcomes. However, biomarker identification needs to be robust and reproducible. Hence a robust unbiased computational framework that can help clinicians identify those biomarkers is necessary.
We developed a pipeline (workflow) that includes two different supervised classification techniques based on regularization methods to identify biomarkers from -omics or other high dimension clinical datasets. The pipeline includes several important steps such as quality control and stability of selected biomarkers. The process takes input files (outcome and independent variables or -omics data) and pre-processes (normalization, missing values) them. After a random division of samples into training and test sets, Least Absolute Shrinkage and Selection Operator and Elastic Net feature selection methods are applied to identify the most important features representing potential biomarker candidates. The penalization parameters are optimised using 10-fold cross validation and the process undergoes 100 iterations and a combinatorial analysis to select the best performing multivariate model. An empirical unbiased assessment of their quality as biomarkers for clinical use is performed through a Receiver Operating Characteristic curve and its Area Under the Curve analysis on both permuted and real data for 1000 different randomized training and test sets. We validated this pipeline against previously published biomarkers.
We applied this pipeline to three different datasets with previously published biomarkers: lipidomics data by Acharjee et al. (Metabolomics 13:25, 2017) and transcriptomics data by Rajamani and Bhasin (Genome Med 8:38, 2016) and Mills et al. (Blood 114:1063-1072, 2009). Our results demonstrate that our method was able to identify both previously published biomarkers as well as new variables that add value to the published results.
We developed a robust pipeline to identify clinically relevant biomarkers that can be applied to different -omics datasets. Such identification reveals potentially novel drug targets and can be used as a part of a machine-learning based patient stratification framework in the translational medicine settings.
转化医学(TM)是一个新兴的领域,旨在有效地将医学或生物学的进展从科学家传递给临床医生。TM 的核心愿景是缩小基础科学和应用科学之间在时间、成本和疾病状态早期诊断方面的差距。生物标志物的鉴定是 TM 中的主要挑战之一。从组学数据中鉴定疾病生物标志物不仅有助于对不同患者群体进行分层,还可以提供早期诊断信息,从而改善患者管理并有可能预防不良后果。然而,生物标志物的鉴定需要稳健且可重复。因此,需要一个稳健的、无偏倚的计算框架来帮助临床医生识别这些生物标志物。
我们开发了一个包含两种不同基于正则化方法的监督分类技术的流水线(工作流程),用于从组学或其他高维临床数据集识别生物标志物。该流水线包括几个重要步骤,例如所选生物标志物的质量控制和稳定性。该过程接受输入文件(结果和独立变量或组学数据)并对其进行预处理(标准化、缺失值)。在将样本随机分为训练集和测试集之后,应用最小绝对收缩和选择算子(Least Absolute Shrinkage and Selection Operator,LASSO)和弹性网络(Elastic Net)特征选择方法来识别最能代表潜在生物标志物候选物的重要特征。使用 10 折交叉验证优化惩罚参数,并通过 100 次迭代和组合分析来选择表现最佳的多元模型。通过在 1000 个不同的随机训练和测试集上对置换数据和真实数据进行接收器操作特征(Receiver Operating Characteristic,ROC)曲线及其曲线下面积(Area Under the Curve,AUC)分析,对其作为临床使用的生物标志物的质量进行经验性无偏评估。我们针对先前发表的生物标志物验证了该流水线。
我们将该流水线应用于具有先前发表的生物标志物的三个不同数据集:Acharjee 等人的脂质组学数据(Metabolomics 13:25, 2017)和 Rajamani 和 Bhasin 的转录组学数据(Genome Med 8:38, 2016)以及 Mills 等人的血液学数据(Blood 114:1063-1072, 2009)。我们的结果表明,我们的方法能够识别先前发表的生物标志物以及为发表结果增加价值的新变量。
我们开发了一种稳健的流水线来识别具有临床相关性的生物标志物,可应用于不同的组学数据集。这种鉴定揭示了潜在的新药物靶点,并可作为转化医学环境中基于机器学习的患者分层框架的一部分使用。