用于识别关键上位性变异的集体特征选择

Collective feature selection to identify crucial epistatic variants.

作者信息

Verma Shefali S, Lucas Anastasia, Zhang Xinyuan, Veturi Yogasudha, Dudek Scott, Li Binglan, Li Ruowang, Urbanowicz Ryan, Moore Jason H, Kim Dokyoon, Ritchie Marylyn D

机构信息

1Biomedical and Translational Bioinformatics Institute, Geisinger Health System, 100 N Academy Avenue, Danville, PA 17822 USA.

2Huck Institute of Life Sciences, The Pennsylvania State University, University Park, PA USA.

出版信息

BioData Min. 2018 Apr 19;11:5. doi: 10.1186/s13040-018-0168-6. eCollection 2018.

DOI:10.1186/s13040-018-0168-6

PMID:29713383

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5907720/

Abstract

BACKGROUND

Machine learning methods have gained popularity and practicality in identifying linear and non-linear effects of variants associated with complex disease/traits. Detection of epistatic interactions still remains a challenge due to the large number of features and relatively small sample size as input, thus leading to the so-called "short fat data" problem. The efficiency of machine learning methods can be increased by limiting the number of input features. Thus, it is very important to perform variable selection before searching for epistasis. Many methods have been evaluated and proposed to perform feature selection, but no single method works best in all scenarios. We demonstrate this by conducting two separate simulation analyses to evaluate the proposed collective feature selection approach.

RESULTS

Through our simulation study we propose a collective feature selection approach to select features that are in the "union" of the best performing methods. We explored various parametric, non-parametric, and data mining approaches to perform feature selection. We choose our top performing methods to select the union of the resulting variables based on a user-defined percentage of variants selected from each method to take to downstream analysis. Our simulation analysis shows that non-parametric data mining approaches, such as MDR, may work best under one simulation criteria for the high effect size (penetrance) datasets, while non-parametric methods designed for feature selection, such as Ranger and Gradient boosting, work best under other simulation criteria. Thus, using a collective approach proves to be more beneficial for selecting variables with epistatic effects also in low effect size datasets and different genetic architectures. Following this, we applied our proposed collective feature selection approach to select the top 1% of variables to identify potential interacting variables associated with Body Mass Index (BMI) in ~ 44,000 samples obtained from Geisinger's MyCode Community Health Initiative (on behalf of DiscovEHR collaboration).

CONCLUSIONS

In this study, we were able to show that selecting variables using a collective feature selection approach could help in selecting true positive epistatic variables more frequently than applying any single method for feature selection via simulation studies. We were able to demonstrate the effectiveness of collective feature selection along with a comparison of many methods in our simulation analysis. We also applied our method to identify non-linear networks associated with obesity.

摘要

背景

机器学习方法在识别与复杂疾病/性状相关的变异的线性和非线性效应方面已变得流行且实用。由于作为输入的特征数量众多且样本量相对较小，检测上位性相互作用仍然是一个挑战，从而导致所谓的“短胖数据”问题。通过限制输入特征的数量可以提高机器学习方法的效率。因此，在搜索上位性之前进行变量选择非常重要。已经评估并提出了许多方法来进行特征选择，但没有一种方法在所有情况下都能达到最佳效果。我们通过进行两项单独的模拟分析来评估所提出的集体特征选择方法，以此证明这一点。

结果

通过我们的模拟研究，我们提出了一种集体特征选择方法，以选择在表现最佳的方法的“并集”中的特征。我们探索了各种参数、非参数和数据挖掘方法来进行特征选择。我们选择表现最佳的方法，根据从每种方法中选择用于下游分析的用户定义百分比的变异来选择所得变量的并集。我们的模拟分析表明，非参数数据挖掘方法，如多因素降维法（MDR），在一种模拟标准下对于高效应大小（外显率）数据集可能效果最佳，而专为特征选择设计的非参数方法，如随机森林（Ranger）和梯度提升，在其他模拟标准下效果最佳。因此，在低效应大小数据集和不同遗传结构中，使用集体方法对于选择具有上位性效应的变量也被证明更有益。在此之后，我们应用所提出的集体特征选择方法选择前1%的变量，以在从盖辛格医疗系统的MyCode社区健康倡议（代表DiscovEHR合作项目）获得的约44,000个样本中识别与体重指数（BMI）相关的潜在相互作用变量。

结论

在本研究中，我们能够表明，通过模拟研究，使用集体特征选择方法选择变量比应用任何单一的特征选择方法更有助于更频繁地选择真正的阳性上位性变量。我们能够在模拟分析中证明集体特征选择的有效性以及许多方法的比较。我们还应用我们的方法来识别与肥胖相关的非线性网络。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ea4/5907720/11c487d2a89f/13040_2018_168_Fig1_HTML.jpg

相似文献

Collective feature selection to identify crucial epistatic variants.

BioData Min. 2018 Apr 19;11:5. doi: 10.1186/s13040-018-0168-6. eCollection 2018.

Part 1. Statistical Learning Methods for the Effects of Multiple Air Pollution Constituents.

Res Rep Health Eff Inst. 2015 Jun(183 Pt 1-2):5-50.

Deep learning versus parametric and ensemble methods for genomic prediction of complex phenotypes.

Genet Sel Evol. 2020 Feb 24;52(1):12. doi: 10.1186/s12711-020-00531-z.

Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data.

Front Mol Biosci. 2016 Jul 8;3:30. doi: 10.3389/fmolb.2016.00030. eCollection 2016.

Gene-gene interaction: the curse of dimensionality.

Ann Transl Med. 2019 Dec;7(24):813. doi: 10.21037/atm.2019.12.87.

Benchmarking relief-based feature selection methods for bioinformatics data mining.

J Biomed Inform. 2018 Sep;85:168-188. doi: 10.1016/j.jbi.2018.07.015. Epub 2018 Jul 17.

fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies.

BMC Bioinformatics. 2019 Jun 13;20(1):333. doi: 10.1186/s12859-019-2869-3.

Variable selection method for the identification of epistatic models.

Pac Symp Biocomput. 2015;20:195-206.

An ensemble-based feature selection framework to select risk factors of childhood obesity for policy decision making.

BMC Med Inform Decis Mak. 2021 Jul 21;21(1):222. doi: 10.1186/s12911-021-01580-0.

Automated quantitative trait locus analysis (AutoQTL).

BioData Min. 2023 Apr 10;16(1):14. doi: 10.1186/s13040-023-00331-3.

引用本文的文献

Systemic lupus erythematosus with high disease activity identification based on machine learning.

Inflamm Res. 2023 Sep;72(9):1909-1918. doi: 10.1007/s00011-023-01793-1. Epub 2023 Sep 19.

Lupus nephritis or not? A simple and clinically friendly machine learning pipeline to help diagnosis of lupus nephritis.

Inflamm Res. 2023 Jun;72(6):1315-1324. doi: 10.1007/s00011-023-01755-7. Epub 2023 Jun 10.

Evidence for Epistatic Interaction between and in the Pathogenesis of Nonsegmental Vitiligo.

Cells. 2023 Feb 15;12(4):630. doi: 10.3390/cells12040630.

Toward Predicting 30-Day Readmission Among Oncology Patients: Identifying Timely and Actionable Risk Factors.

JCO Clin Cancer Inform. 2023 Feb;7:e2200097. doi: 10.1200/CCI.22.00097.

A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction.

Front Bioinform. 2022 Jun 27;2:927312. doi: 10.3389/fbinf.2022.927312. eCollection 2022.

Learning and visualizing chronic latent representations using electronic health records.

BioData Min. 2022 Sep 5;15(1):18. doi: 10.1186/s13040-022-00303-z.

Brief Survey on Machine Learning in Epistasis.

Methods Mol Biol. 2021;2212:169-179. doi: 10.1007/978-1-0716-0947-7_11.

Relief-based feature selection: Introduction and review.

J Biomed Inform. 2018 Sep;85:189-203. doi: 10.1016/j.jbi.2018.07.014. Epub 2018 Jul 18.

Benchmarking relief-based feature selection methods for bioinformatics data mining.

J Biomed Inform. 2018 Sep;85:168-188. doi: 10.1016/j.jbi.2018.07.015. Epub 2018 Jul 17.

本文引用的文献

PLATO software provides analytic framework for investigating complexity beyond genome-wide association studies.

Nat Commun. 2017 Oct 27;8(1):1167. doi: 10.1038/s41467-017-00802-2.

The optimal crowd learning machine.

BioData Min. 2017 May 19;10:16. doi: 10.1186/s13040-017-0135-7. eCollection 2017.

Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study.

Science. 2016 Dec 23;354(6319). doi: 10.1126/science.aaf6814.

A unified model based multifactor dimensionality reduction framework for detecting gene-gene interactions.

Bioinformatics. 2016 Sep 1;32(17):i605-i610. doi: 10.1093/bioinformatics/btw424.

Tenascin-C Is Associated with Cored Amyloid-β Plaques in Alzheimer Disease and Pathology Burdened Cognitively Normal Elderly.

J Neuropathol Exp Neurol. 2016 Sep;75(9):868-76. doi: 10.1093/jnen/nlw062. Epub 2016 Jul 21.

Identification of genetic interaction networks via an evolutionary algorithm evolved Bayesian network.

BioData Min. 2016 May 10;9:18. doi: 10.1186/s13040-016-0094-4. eCollection 2016.

The Geisinger MyCode community health initiative: an electronic health record-linked biobank for precision medicine research.

Genet Med. 2016 Sep;18(9):906-13. doi: 10.1038/gim.2015.187. Epub 2016 Feb 11.

Identifying gene-gene interactions that are highly associated with Body Mass Index using Quantitative Multifactor Dimensionality Reduction (QMDR).

BioData Min. 2015 Dec 14;8:41. doi: 10.1186/s13040-015-0074-0. eCollection 2015.

Development of a blood-based molecular biomarker test for identification of schizophrenia before disease onset.

Transl Psychiatry. 2015 Jul 14;5(7):e601. doi: 10.1038/tp.2015.91.

A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data.

Adv Bioinformatics. 2015;2015:198363. doi: 10.1155/2015/198363. Epub 2015 Jun 11.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于识别关键上位性变异的集体特征选择

Collective feature selection to identify crucial epistatic variants.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献