Suppr超能文献

用于识别关键上位性变异的集体特征选择

Collective feature selection to identify crucial epistatic variants.

作者信息

Verma Shefali S, Lucas Anastasia, Zhang Xinyuan, Veturi Yogasudha, Dudek Scott, Li Binglan, Li Ruowang, Urbanowicz Ryan, Moore Jason H, Kim Dokyoon, Ritchie Marylyn D

机构信息

1Biomedical and Translational Bioinformatics Institute, Geisinger Health System, 100 N Academy Avenue, Danville, PA 17822 USA.

2Huck Institute of Life Sciences, The Pennsylvania State University, University Park, PA USA.

出版信息

BioData Min. 2018 Apr 19;11:5. doi: 10.1186/s13040-018-0168-6. eCollection 2018.

Abstract

BACKGROUND

Machine learning methods have gained popularity and practicality in identifying linear and non-linear effects of variants associated with complex disease/traits. Detection of epistatic interactions still remains a challenge due to the large number of features and relatively small sample size as input, thus leading to the so-called "short fat data" problem. The efficiency of machine learning methods can be increased by limiting the number of input features. Thus, it is very important to perform variable selection before searching for epistasis. Many methods have been evaluated and proposed to perform feature selection, but no single method works best in all scenarios. We demonstrate this by conducting two separate simulation analyses to evaluate the proposed collective feature selection approach.

RESULTS

Through our simulation study we propose a collective feature selection approach to select features that are in the "union" of the best performing methods. We explored various parametric, non-parametric, and data mining approaches to perform feature selection. We choose our top performing methods to select the union of the resulting variables based on a user-defined percentage of variants selected from each method to take to downstream analysis. Our simulation analysis shows that non-parametric data mining approaches, such as MDR, may work best under one simulation criteria for the high effect size (penetrance) datasets, while non-parametric methods designed for feature selection, such as Ranger and Gradient boosting, work best under other simulation criteria. Thus, using a collective approach proves to be more beneficial for selecting variables with epistatic effects also in low effect size datasets and different genetic architectures. Following this, we applied our proposed collective feature selection approach to select the top 1% of variables to identify potential interacting variables associated with Body Mass Index (BMI) in ~ 44,000 samples obtained from Geisinger's MyCode Community Health Initiative (on behalf of DiscovEHR collaboration).

CONCLUSIONS

In this study, we were able to show that selecting variables using a collective feature selection approach could help in selecting true positive epistatic variables more frequently than applying any single method for feature selection via simulation studies. We were able to demonstrate the effectiveness of collective feature selection along with a comparison of many methods in our simulation analysis. We also applied our method to identify non-linear networks associated with obesity.

摘要

背景

机器学习方法在识别与复杂疾病/性状相关的变异的线性和非线性效应方面已变得流行且实用。由于作为输入的特征数量众多且样本量相对较小,检测上位性相互作用仍然是一个挑战,从而导致所谓的“短胖数据”问题。通过限制输入特征的数量可以提高机器学习方法的效率。因此,在搜索上位性之前进行变量选择非常重要。已经评估并提出了许多方法来进行特征选择,但没有一种方法在所有情况下都能达到最佳效果。我们通过进行两项单独的模拟分析来评估所提出的集体特征选择方法,以此证明这一点。

结果

通过我们的模拟研究,我们提出了一种集体特征选择方法,以选择在表现最佳的方法的“并集”中的特征。我们探索了各种参数、非参数和数据挖掘方法来进行特征选择。我们选择表现最佳的方法,根据从每种方法中选择用于下游分析的用户定义百分比的变异来选择所得变量的并集。我们的模拟分析表明,非参数数据挖掘方法,如多因素降维法(MDR),在一种模拟标准下对于高效应大小(外显率)数据集可能效果最佳,而专为特征选择设计的非参数方法,如随机森林(Ranger)和梯度提升,在其他模拟标准下效果最佳。因此,在低效应大小数据集和不同遗传结构中,使用集体方法对于选择具有上位性效应的变量也被证明更有益。在此之后,我们应用所提出的集体特征选择方法选择前1%的变量,以在从盖辛格医疗系统的MyCode社区健康倡议(代表DiscovEHR合作项目)获得的约44,000个样本中识别与体重指数(BMI)相关的潜在相互作用变量。

结论

在本研究中,我们能够表明,通过模拟研究,使用集体特征选择方法选择变量比应用任何单一的特征选择方法更有助于更频繁地选择真正的阳性上位性变量。我们能够在模拟分析中证明集体特征选择的有效性以及许多方法的比较。我们还应用我们的方法来识别与肥胖相关的非线性网络。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ea4/5907720/11c487d2a89f/13040_2018_168_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验