Department of Pharmacology and Toxicology, University of Louisville, Louisville, KY, USA.
Department of Bioengineering, University of Louisville, Louisville, KY, USA.
Metabolomics. 2021 Mar 27;17(4):37. doi: 10.1007/s11306-021-01787-2.
The identification of metabolomic biomarkers predictive of cancer patient response to therapy and of disease stage has been pursued as a "holy grail" of modern oncology, relying on the metabolic dysfunction that characterizes cancer progression. In spite of the evaluation of many candidate biomarkers, however, determination of a consistent set with practical clinical utility has proven elusive.
In this study, we systematically examine the combined role of data pre-treatment and imputation methods on the performance of multivariate data analysis methods and their identification of potential biomarkers.
Uniquely, we are able to systematically evaluate both unsupervised and supervised methods with a metabolomic data set obtained from patient-derived lung cancer core biopsies with true missing values. Eight pre-treatment methods, ten imputation methods, and two data analysis methods were applied in combination.
The combined choice of pre-treatment and imputation methods is critical in the definition of candidate biomarkers, with deficient or inappropriate selection of these methods leading to inconsistent results, and with important biomarkers either being overlooked or reported as a false positive. The log transformation appeared to normalize the original tumor data most effectively, but the performance of the imputation applied after the transformation was highly dependent on the characteristics of the data set.
The combined choice of pre-treatment and imputation methods may need careful evaluation prior to metabolomic data analysis of human tumors, in order to enable consistent identification of potential biomarkers predictive of response to therapy and of disease stage.
识别能够预测癌症患者对治疗反应和疾病阶段的代谢组学生物标志物,一直是现代肿瘤学的“圣杯”,这依赖于能够表征癌症进展的代谢功能障碍。然而,尽管评估了许多候选生物标志物,但确定一套具有实际临床应用价值的生物标志物仍然难以实现。
在这项研究中,我们系统地研究了数据预处理和插补方法对多元数据分析方法性能及其潜在生物标志物识别的综合作用。
我们能够系统地评估来自患者衍生的肺癌核心活检的代谢组学数据集的无监督和有监督方法,该数据集具有真实的缺失值。应用了八种预处理方法、十种插补方法和两种数据分析方法。
预处理和插补方法的综合选择对于候选生物标志物的定义至关重要,这些方法的选择不足或不当会导致结果不一致,并且重要的生物标志物要么被忽略,要么被错误地报告为假阳性。对数变换似乎最有效地对原始肿瘤数据进行了归一化,但应用于变换后的插补方法的性能高度依赖于数据集的特征。
在对人类肿瘤的代谢组学数据进行分析之前,可能需要仔细评估预处理和插补方法的综合选择,以能够一致地识别预测治疗反应和疾病阶段的潜在生物标志物。