信息论特征选择和机器学习方法在遗传风险预测模型开发中的应用。

Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models.

机构信息

Centre for Genetics and Genomics Versus Arthritis,Centre for Musculoskeletal Research,Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, The University of Manchester, Manchester, M13 9PT, UK.

Department of Medical and Molecular Genetics, Faculty of Life Sciences and Medicine, King's College London, London , UK.

出版信息

Sci Rep. 2021 Dec 2;11(1):23335. doi: 10.1038/s41598-021-00854-x.

DOI:10.1038/s41598-021-00854-x

PMID:34857774

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8640070/

Abstract

In view of the growth of clinical risk prediction models using genetic data, there is an increasing need for studies that use appropriate methods to select the optimum number of features from a large number of genetic variants with a high degree of redundancy between features due to linkage disequilibrium (LD). Filter feature selection methods based on information theoretic criteria, are well suited to this challenge and will identify a subset of the original variables that should result in more accurate prediction. However, data collected from cohort studies are often high-dimensional genetic data with potential confounders presenting challenges to feature selection and risk prediction machine learning models. Patients with psoriasis are at high risk of developing a chronic arthritis known as psoriatic arthritis (PsA). The prevalence of PsA in this patient group can be up to 30% and the identification of high risk patients represents an important clinical research which would allow early intervention and a reduction of disability. This also provides us with an ideal scenario for the development of clinical risk prediction models and an opportunity to explore the application of information theoretic criteria methods. In this study, we developed the feature selection and psoriatic arthritis (PsA) risk prediction models that were applied to a cross-sectional genetic dataset of 1462 PsA cases and 1132 cutaneous-only psoriasis (PsC) cases using 2-digit HLA alleles imputed using the SNP2HLA algorithm. We also developed stratification method to mitigate the impact of potential confounder features and illustrate that confounding features impact the feature selection. The mitigated dataset was used in training of seven supervised algorithms. 80% of data was randomly used for training of seven supervised machine learning methods using stratified nested cross validation and 20% was selected randomly as a holdout set for internal validation. The risk prediction models were then further validated in UK Biobank dataset containing data on 1187 participants and a set of features overlapping with the training dataset.Performance of these methods has been evaluated using the area under the curve (AUC), accuracy, precision, recall, F1 score and decision curve analysis(net benefit). The best model is selected based on three criteria: the 'lowest number of feature subset' with the 'maximal average AUC over the nested cross validation' and good generalisability to the UK Biobank dataset. In the original dataset, with over 100 different bootstraps and seven feature selection (FS) methods, HLA_C_*06 was selected as the most informative genetic variant. When the dataset is mitigated the single most important genetic features based on rank was identified as HLA_B_*27 by the seven different feature selection methods, consistent with previous analyses of this data using regression based methods. However, the predictive accuracy of these single features in post mitigation was found to be moderate (AUC= 0.54 (internal cross validation), AUC=0.53 (internal hold out set), AUC=0.55(external data set)). Sequentially adding additional HLA features based on rank improved the performance of the Random Forest classification model where 20 2-digit features selected by Interaction Capping (ICAP) demonstrated (AUC= 0.61 (internal cross validation), AUC=0.57 (internal hold out set), AUC=0.58 (external dataset)). The stratification method for mitigation of confounding features and filter information theoretic feature selection can be applied to a high dimensional dataset with the potential confounders.

摘要

鉴于使用遗传数据的临床风险预测模型的增长，越来越需要使用适当的方法从具有高度冗余性的大量遗传变体中选择最佳数量的特征，这种高度冗余性是由于连锁不平衡（LD）引起的。基于信息论准则的过滤特征选择方法非常适合这种挑战，它可以识别原始变量的一个子集，从而得到更准确的预测。然而，从队列研究中收集的数据通常是具有潜在混杂因素的高维遗传数据，这给特征选择和风险预测机器学习模型带来了挑战。患有银屑病的患者患慢性关节炎（称为银屑病关节炎（PsA））的风险很高。在这群患者中，PsA 的患病率可达 30%，识别高危患者是一项重要的临床研究，这将允许早期干预和减少残疾。这也为我们开发临床风险预测模型提供了一个理想的场景，并为探索信息论准则方法的应用提供了机会。在这项研究中，我们开发了特征选择和银屑病关节炎（PsA）风险预测模型，该模型应用于使用 SNP2HLA 算法推断的 1462 例 PsA 病例和 1132 例皮肤银屑病（PsC）病例的横断面遗传数据集。我们还开发了分层方法来减轻潜在混杂因素特征的影响，并说明了混杂因素会影响特征选择。减轻后的数据集用于训练七个有监督的算法。使用分层嵌套交叉验证随机选择 80%的数据用于训练七个有监督的机器学习方法，随机选择 20%的数据作为内部验证的保留集。然后在包含 1187 名参与者数据的 UK Biobank 数据集和一组与训练数据集重叠的特征中进一步验证风险预测模型。使用曲线下面积（AUC）、准确性、精度、召回率、F1 分数和决策曲线分析（净收益）评估这些方法的性能。基于三个标准选择最佳模型：具有“嵌套交叉验证中最大平均 AUC”的“最小特征子集数量”和对 UK Biobank 数据集的良好通用性。在原始数据集中，有超过 100 个不同的引导程序和七种特征选择（FS）方法，HLA_C_*06 被选为最具信息量的遗传变体。当数据集被减轻时，七种不同的特征选择方法根据等级确定了最重要的单一遗传特征是 HLA_B_*27，这与使用基于回归的方法对该数据进行的先前分析一致。然而，在减轻后，这些单一特征的预测准确性被发现是中等的（AUC=0.54（内部交叉验证），AUC=0.53（内部保留集），AUC=0.55（外部数据集））。基于等级顺序添加其他 HLA 特征可提高随机森林分类模型的性能，其中 Interaction Capping（ICAP）选择的 20 个 2 位特征表现出（AUC=0.61（内部交叉验证），AUC=0.57（内部保留集），AUC=0.58（外部数据集））。用于减轻混杂因素的分层方法和过滤信息论特征选择可以应用于具有潜在混杂因素的高维数据集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/07be/8640070/cd2bceeb3d0c/41598_2021_854_Fig1_HTML.jpg

相似文献

Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models.信息论特征选择和机器学习方法在遗传风险预测模型开发中的应用。

Sci Rep. 2021 Dec 2;11(1):23335. doi: 10.1038/s41598-021-00854-x.

Can Predictive Modeling Tools Identify Patients at High Risk of Prolonged Opioid Use After ACL Reconstruction?预测模型工具能否识别 ACL 重建术后阿片类药物使用时间延长的高风险患者？

Clin Orthop Relat Res. 2020 Jul;478(7):0-1618. doi: 10.1097/CORR.0000000000001251.

Supervised Machine Learning Models for Predicting Sepsis-Associated Liver Injury in Patients With Sepsis: Development and Validation Study Based on a Multicenter Cohort Study.用于预测脓毒症患者脓毒症相关肝损伤的监督式机器学习模型：基于多中心队列研究的开发与验证研究

J Med Internet Res. 2025 May 26;27:e66733. doi: 10.2196/66733.

Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers.机器学习算法在（放化疗）治疗结果预测中的应用：分类器的实证比较。

Med Phys. 2018 Jul;45(7):3449-3459. doi: 10.1002/mp.12967. Epub 2018 Jun 13.

[Constructing a predictive model for the death risk of patients with septic shock based on supervised machine learning algorithms].基于监督机器学习算法构建脓毒症休克患者死亡风险预测模型

Zhonghua Wei Zhong Bing Ji Jiu Yi Xue. 2024 Apr;36(4):345-352. doi: 10.3760/cma.j.cn121430-20230930-00832.

Risk prediction model for psoriatic arthritis: NHANES data and multi-algorithm approach.银屑病关节炎风险预测模型：美国国家健康与营养检查调查（NHANES）数据及多算法方法

Clin Rheumatol. 2025 Jan;44(1):277-289. doi: 10.1007/s10067-024-07244-4. Epub 2024 Nov 25.

Development and application of an early prediction model for risk of bloodstream infection based on real-world study.基于真实世界研究的血流感染风险早期预测模型的开发与应用

BMC Med Inform Decis Mak. 2025 May 14;25(1):186. doi: 10.1186/s12911-025-03020-9.

Psoriatic arthritis in psoriasis: optimizing the current screening system for psoriatic arthritis based on serum data from U.S. and Chinese populations.银屑病中的银屑病关节炎：基于美国和中国人群血清数据优化当前银屑病关节炎筛查系统

Front Immunol. 2024 Dec 10;15:1497713. doi: 10.3389/fimmu.2024.1497713. eCollection 2024.

Prediction and feature selection of low birth weight using machine learning algorithms.利用机器学习算法预测和选择低出生体重。

J Health Popul Nutr. 2024 Oct 12;43(1):157. doi: 10.1186/s41043-024-00647-8.

Development and validation of a machine learning-based predictive model for assessing the 90-day prognostic outcome of patients with spontaneous intracerebral hemorrhage.基于机器学习的预测模型评估自发性脑出血患者 90 天预后结局的开发与验证。

J Transl Med. 2024 Mar 4;22(1):236. doi: 10.1186/s12967-024-04896-3.

引用本文的文献

Current application, possibilities, and challenges of artificial intelligence in the management of rheumatoid arthritis, axial spondyloarthritis, and psoriatic arthritis.人工智能在类风湿关节炎、轴性脊柱关节炎和银屑病关节炎管理中的当前应用、可能性及挑战。

Ther Adv Musculoskelet Dis. 2025 Jun 21;17:1759720X251343579. doi: 10.1177/1759720X251343579. eCollection 2025.

Developing a DNA Methylation Signature to Differentiate High-Grade Serous Ovarian Carcinomas from Benign Ovarian Tumors.开发一种 DNA 甲基化特征，以区分高级别浆液性卵巢癌与良性卵巢肿瘤。

Mol Diagn Ther. 2024 Nov;28(6):821-834. doi: 10.1007/s40291-024-00740-y. Epub 2024 Oct 16.

Predicting Unmet Healthcare Needs in Post-Disaster: A Machine Learning Approach.预测灾后未满足的医疗需求：一种机器学习方法。

Int J Environ Res Public Health. 2023 Sep 24;20(19):6817. doi: 10.3390/ijerph20196817.

Interferons and Resistance Mechanisms in Tumors and Pathogen-Driven Diseases-Focus on the Major Histocompatibility Complex (MHC) Antigen Processing Pathway.干扰素与肿瘤及病原体驱动性疾病的耐药机制——以主要组织相容性复合体（MHC）抗原加工途径为重点。

Int J Mol Sci. 2023 Apr 4;24(7):6736. doi: 10.3390/ijms24076736.

Radiomics features of DSC-PWI in time dimension may provide a new chance to identify ischemic stroke.DSC-PWI在时间维度上的影像组学特征可能为识别缺血性卒中提供新的契机。

Front Neurol. 2022 Nov 4;13:889090. doi: 10.3389/fneur.2022.889090. eCollection 2022.

Novel Survival Features Generated by Clinical Text Information and Radiomics Features May Improve the Prediction of Ischemic Stroke Outcome.由临床文本信息和影像组学特征生成的新型生存特征可能会改善缺血性中风预后的预测。

Diagnostics (Basel). 2022 Jul 8;12(7):1664. doi: 10.3390/diagnostics12071664.

Harnessing Big Data, Smart and Digital Technologies and Artificial Intelligence for Preventing, Early Intercepting, Managing, and Treating Psoriatic Arthritis: Insights From a Systematic Review of the Literature.利用大数据、智能与数字技术以及人工智能预防、早期拦截、管理和治疗银屑病关节炎：来自文献系统综述的见解

Front Immunol. 2022 Mar 10;13:847312. doi: 10.3389/fimmu.2022.847312. eCollection 2022.

本文引用的文献

Machine Learning for Clinical Outcome Prediction.机器学习在临床结局预测中的应用。

IEEE Rev Biomed Eng. 2021;14:116-126. doi: 10.1109/RBME.2020.3007816. Epub 2021 Jan 22.

HLA-C*06:02 genotype is a predictive biomarker of biologic treatment response in psoriasis.HLA-C*06:02 基因型是银屑病生物治疗反应的预测性生物标志物。

J Allergy Clin Immunol. 2019 Jun;143(6):2120-2130. doi: 10.1016/j.jaci.2018.11.038. Epub 2018 Dec 20.

The UK Biobank resource with deep phenotyping and genomic data.英国生物银行资源库，具有深度表型和基因组数据。

Nature. 2018 Oct;562(7726):203-209. doi: 10.1038/s41586-018-0579-z. Epub 2018 Oct 10.

Genetic signature to provide robust risk assessment of psoriatic arthritis development in psoriasis patients.遗传标志物为银屑病患者的关节炎发展提供稳健的风险评估。

Nat Commun. 2018 Oct 9;9(1):4178. doi: 10.1038/s41467-018-06672-6.

Prevalence of psoriatic arthritis in patients with psoriasis: A systematic review and meta-analysis of observational and clinical studies.银屑病患者中银屑病关节炎的患病率：观察性和临床研究的系统评价和荟萃分析。

J Am Acad Dermatol. 2019 Jan;80(1):251-265.e19. doi: 10.1016/j.jaad.2018.06.027. Epub 2018 Jun 19.

Cross-phenotype association mapping of the MHC identifies genetic variants that differentiate psoriatic arthritis from psoriasis.主要组织相容性复合体的跨表型关联定位识别出区分银屑病关节炎和银屑病的基因变异。

Ann Rheum Dis. 2017 Oct;76(10):1774-1779. doi: 10.1136/annrheumdis-2017-211414. Epub 2017 Aug 18.

Psoriatic Arthritis.银屑病关节炎

N Engl J Med. 2017 Mar 9;376(10):957-970. doi: 10.1056/NEJMra1505557.

Quantifying the extent to which index event biases influence large genetic association studies.量化索引事件偏差对大型基因关联研究的影响程度。

Hum Mol Genet. 2017 Mar 1;26(5):1018-1030. doi: 10.1093/hmg/ddw433.

Predicting the Future - Big Data, Machine Learning, and Clinical Medicine.预测未来——大数据、机器学习与临床医学。

N Engl J Med. 2016 Sep 29;375(13):1216-9. doi: 10.1056/NEJMp1606181.

Widespread non-additive and interaction effects within HLA loci modulate the risk of autoimmune diseases.HLA基因座内广泛存在的非加性和相互作用效应调节自身免疫性疾病的风险。

Nat Genet. 2015 Sep;47(9):1085-90. doi: 10.1038/ng.3379. Epub 2015 Aug 10.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

信息论特征选择和机器学习方法在遗传风险预测模型开发中的应用。

Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献