使用机器学习对公开可用的基因表达数据库进行重新评估可在乳腺癌中产生最大的预后能力。

Re-evaluation of publicly available gene-expression databases using machine-learning yields a maximum prognostic power in breast cancer.

作者信息

Tschodu Dimitrij, Lippoldt Jürgen, Gottheil Pablo, Wegscheider Anne-Sophie, Käs Josef A, Niendorf Axel

机构信息

Peter Debye Institute for Soft Matter Physics, Leipzig University, 04103, Leipzig, Germany.

Institute for Histology, Cytology and Molecular Diagnostics, MVZ Prof. Dr. med. A. Niendorf Pathologie Hamburg-West GmbH, 22767, Hamburg, Germany.

出版信息

Sci Rep. 2023 Oct 5;13(1):16402. doi: 10.1038/s41598-023-41090-9.

DOI:10.1038/s41598-023-41090-9

PMID:37798300

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10556090/

Abstract

Gene expression signatures refer to patterns of gene activities and are used to classify different types of cancer, determine prognosis, and guide treatment decisions. Advancements in high-throughput technology and machine learning have led to improvements to predict a patient's prognosis for different cancer phenotypes. However, computational methods for analyzing signatures have not been used to evaluate their prognostic power. Contention remains on the utility of gene expression signatures for prognosis. The prevalent approaches include random signatures, expert knowledge, and machine learning to construct an improved signature. We unify these approaches to evaluate their prognostic power. Re-evaluation of publicly available gene-expression data from 8 databases with 9 machine-learning models revealed previously unreported results. Gene-expression signatures are confirmed to be useful in predicting a patient's prognosis. Convergent evidence from [Formula: see text] 10,000 signatures implicates a maximum prognostic power. By calculating the concordance index, which measures how well patients with different prognoses can be discriminated, we show that a signature can correctly discriminate patients' prognoses no more than 80% of the time. Additionally, we show that more than 50% of the potentially available information is still missing at this value. We surmise that an accurate prognosis must incorporate molecular, clinical, histological, and other complementary factors.

摘要

基因表达特征指的是基因活动模式，用于对不同类型的癌症进行分类、确定预后并指导治疗决策。高通量技术和机器学习的进步使得预测患者不同癌症表型的预后有了改进。然而，用于分析特征的计算方法尚未用于评估其预后能力。关于基因表达特征对预后的效用仍存在争议。常见的方法包括随机特征、专家知识和机器学习来构建改进的特征。我们统一这些方法来评估其预后能力。用9种机器学习模型对来自8个数据库的公开可用基因表达数据进行重新评估，揭示了以前未报告的结果。基因表达特征被证实在预测患者预后方面是有用的。来自超过10000个特征的汇聚证据表明具有最大预后能力。通过计算一致性指数（该指数衡量不同预后患者的区分程度），我们表明一个特征正确区分患者预后的时间不超过80%。此外，我们表明在这个值时仍有超过50%的潜在可用信息缺失。我们推测准确的预后必须纳入分子、临床、组织学和其他补充因素。