MRC Epidemiology Unit, University of Cambridge School of Clinical Medicine, Institute of Metabolic Science, Cambridge, UK; Precision Healthcare University Research Institute, Queen Mary University of London, London, UK.
MRC Epidemiology Unit, University of Cambridge School of Clinical Medicine, Institute of Metabolic Science, Cambridge, UK; Computational Medicine, Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Berlin, Germany; Precision Healthcare University Research Institute, Queen Mary University of London, London, UK.
Lancet Digit Health. 2024 Jul;6(7):e470-e479. doi: 10.1016/S2589-7500(24)00087-6.
Broad-capture proteomic technologies have the potential to improve disease prediction, enabling targeted prevention and management, but studies have so far been limited to very few selected diseases and have not evaluated predictive performance across multiple conditions. We aimed to evaluate the potential of serum proteins to improve risk prediction over and above health-derived information and polygenic risk scores across a diverse set of 24 outcomes.
We designed multiple case-cohorts nested in the EPIC-Norfolk prospective study, from participants with available serum samples and genome-wide genotype data, with more than 32 974 person-years of follow-up. Participants were middle-aged individuals (aged 40-79 years at baseline) of European ancestry who were recruited from the general population of Norfolk, England, between March, 1993 and December, 1997. We selected participants who developed one of ten less common diseases within 10 years of follow-up; we also subsampled a randomly drawn control subcohort, which also served to investigate 14 more common outcomes (n>70), including all-cause premature mortality (death before the age of 75 years; case numbers 71-437; controls 608-1556). Individuals were excluded from the current study owing to failed genotyping or proteomic quality control, relatedness, or missing information on age, sex, BMI, or smoking status. We used a machine learning framework to derive sparse predictive protein models for the onset of the the 23 individual diseases and all-cause premature mortality, and to derive a single common sparse multimorbidity signature that was predictive across multiple diseases from 2923 serum proteins.
Participants who developed one of ten less common diseases within 10 years of follow-up included 482 women and 507 men, with a mean age at baseline of 64·56 years (8·08). The random subcohort included 990 women and 769 men, with a mean age of 58·79 years (9·31). As few as five proteins alone outperformed polygenic risk scores for 17 of 23 outcomes (median dfference in concordance index [C-index] 0·13 [0·10-0·17]) and improved predictive performance when added over basic patient-derived information models for seven outcomes, achieving a median C-index of 0·82 (IQR 0·77-0·82). This included diseases with poor prognosis such as lung cancer (C-index 0·85 [+/- cross-validation error 0·83-0·87]), for which we identified unreported biomarkers such as C-X-C motif chemokine ligand 17. A sparse multimorbidity signature of ten proteins improved prediction across seven outcomes over patient-derived information models, achieving performances (median C-index 0·81 [IQR 0·80-0·82]) similar to those of disease-specific signatures.
We show the value of broad-capture proteomic biomarker discovery studies across multiple diseases of diverse causes, pointing to those that might benefit the most from proteomic approaches, and the potential to derive common sparse biomarker panels for prediction of multiple diseases at once. This framework could enable follow-up studies to explore the generalisability of proteomic models and to benchmark these against clinical assays, which are required to understand the translational potential of these findings.
Medical Research Council, Health Data Research UK, UK Research and Innovation-National Institute for Health and Care Research, Cancer Research UK, and Wellcome Trust.
宽捕获蛋白质组学技术有可能改善疾病预测,从而实现有针对性的预防和管理,但迄今为止,这些研究仅限于极少数选定的疾病,并且尚未评估多种情况下的预测性能。我们旨在评估血清蛋白在超越健康衍生信息和多基因风险评分的基础上,对 24 种以上结果的风险预测的潜力。
我们从具有可用血清样本和全基因组基因型数据的 EPIC-Norfolk 前瞻性研究中的多个病例对照队列中进行设计,随访时间超过 32974 人年。参与者是年龄在基线时为 40-79 岁的欧洲血统的中年人,他们是从英格兰诺福克的一般人群中招募的。我们选择了在 10 年内出现十种较不常见疾病之一的参与者;我们还随机抽取了一个对照组亚群,该亚群也用于调查 14 种更常见的结果(n>70),包括全因过早死亡(75 岁以下死亡;病例数 71-437;对照 608-1556)。由于基因分型或蛋白质组学质量控制失败、亲缘关系或年龄、性别、BMI 或吸烟状况缺失信息,个体被排除在当前研究之外。我们使用机器学习框架为 23 种个体疾病和全因过早死亡的发病推导稀疏预测蛋白质模型,并从 2923 种血清蛋白中推导一种可预测多种疾病的单一常见稀疏多疾病特征。
在 10 年内出现十种较不常见疾病之一的参与者包括 482 名女性和 507 名男性,基线时的平均年龄为 64.56 岁(8.08)。随机亚群包括 990 名女性和 769 名男性,平均年龄为 58.79 岁(9.31)。仅使用五种蛋白质就可以胜过 23 种结果中的 17 种的多基因风险评分(中位数一致性指数 [C 指数] 0.13 [0.10-0.17]),并且在将其添加到七个结果的基本患者衍生信息模型中时可以改善预测性能,从而实现中位数 C 指数为 0.82(IQR 0.77-0.82)。这包括预后较差的疾病,如肺癌(C 指数 0.85 [+/- 交叉验证误差 0.83-0.87]),我们在其中鉴定了未报告的生物标志物,如 C-X-C 基序趋化因子配体 17。十种蛋白质的稀疏多疾病特征可改善七个结果的预测,优于患者衍生信息模型,表现(中位数 C 指数 0.81 [IQR 0.80-0.82])与疾病特异性特征相似。
我们展示了在多种不同病因的疾病中进行广泛捕获蛋白质组学生物标志物发现研究的价值,指出了那些最有可能从蛋白质组学方法中受益的疾病,并有可能为多种疾病的预测得出常见的稀疏生物标志物面板。该框架可以使后续研究能够探索蛋白质组模型的普遍性,并将这些模型与临床检测进行基准测试,这是了解这些发现的转化潜力所必需的。
医学研究理事会、英国健康数据研究署、英国研究与创新-国家健康与护理研究所、英国癌症研究中心和惠康基金会。