• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于全基因组关联研究的表型预测:在吸烟行为中的应用

Phenotype prediction from genome-wide association studies: application to smoking behaviors.

作者信息

Yoon Dankyu, Kim Young Jin, Park Taesung

机构信息

Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 151-742, Korea.

出版信息

BMC Syst Biol. 2012;6 Suppl 2(Suppl 2):S11. doi: 10.1186/1752-0509-6-S2-S11. Epub 2012 Dec 12.

DOI:10.1186/1752-0509-6-S2-S11
PMID:23281841
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3521177/
Abstract

BACKGROUND

A great success of the genome wide association study enabled us to give more attention on the personal genome and clinical application such as diagnosis and disease risk prediction. However, previous prediction studies using known disease associated loci have not been successful (Area Under Curve 0.55 ~ 0.68 for type 2 diabetes and coronary heart disease). There are several reasons for poor predictability such as small number of known disease-associated loci, simple analysis not considering complexity in phenotype, and a limited number of features used for prediction.

METHODS

In this research, we investigated the effect of feature selection and prediction algorithm on the performance of prediction method thoroughly. In particular, we considered the following feature selection and prediction methods: regression analysis, regularized regression analysis, linear discriminant analysis, non-linear support vector machine, and random forest. For these methods, we studied the effects of feature selection and the number of features on prediction. Our investigation was based on the analysis of 8,842 Korean individuals genotyped by Affymetrix SNP array 5.0, for predicting smoking behaviors.

RESULTS

To observe the effect of feature selection methods on prediction performance, selected features were used for prediction and area under the curve score was measured. For feature selection, the performances of support vector machine (SVM) and elastic-net (EN) showed better results than those of linear discriminant analysis (LDA), random forest (RF) and simple logistic regression (LR) methods. For prediction, SVM showed the best performance based on area under the curve score. With less than 100 SNPs, EN was the best prediction method while SVM was the best if over 400 SNPs were used for the prediction.

CONCLUSIONS

Based on combination of feature selection and prediction methods, SVM showed the best performance in feature selection and prediction.

摘要

背景

全基因组关联研究的巨大成功使我们能够更加关注个人基因组及其在诊断和疾病风险预测等临床应用中的作用。然而,以往使用已知疾病相关位点的预测研究并不成功(2型糖尿病和冠心病的曲线下面积为0.55至0.68)。预测能力较差有几个原因,比如已知疾病相关位点数量较少、未考虑表型复杂性的简单分析以及用于预测的特征数量有限。

方法

在本研究中,我们全面研究了特征选择和预测算法对预测方法性能的影响。具体而言,我们考虑了以下特征选择和预测方法:回归分析、正则化回归分析、线性判别分析、非线性支持向量机和随机森林。对于这些方法,我们研究了特征选择和特征数量对预测的影响。我们的研究基于对8842名通过Affymetrix SNP阵列5.0进行基因分型的韩国个体的分析,以预测吸烟行为。

结果

为了观察特征选择方法对预测性能的影响,将所选特征用于预测并测量曲线下面积得分。对于特征选择,支持向量机(SVM)和弹性网络(EN)的性能优于线性判别分析(LDA)、随机森林(RF)和简单逻辑回归(LR)方法。对于预测,基于曲线下面积得分,SVM表现最佳。当使用少于100个单核苷酸多态性(SNP)时,EN是最佳预测方法,而如果使用超过400个SNP进行预测,SVM则是最佳方法。

结论

基于特征选择和预测方法的组合,SVM在特征选择和预测方面表现出最佳性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6053/3521177/d20031aa1688/1752-0509-6-S2-S11-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6053/3521177/3b37884cf5c8/1752-0509-6-S2-S11-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6053/3521177/d20031aa1688/1752-0509-6-S2-S11-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6053/3521177/3b37884cf5c8/1752-0509-6-S2-S11-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6053/3521177/d20031aa1688/1752-0509-6-S2-S11-2.jpg

相似文献

1
Phenotype prediction from genome-wide association studies: application to smoking behaviors.基于全基因组关联研究的表型预测:在吸烟行为中的应用
BMC Syst Biol. 2012;6 Suppl 2(Suppl 2):S11. doi: 10.1186/1752-0509-6-S2-S11. Epub 2012 Dec 12.
2
Prediction of delayed graft function after kidney transplantation: comparison between logistic regression and machine learning methods.肾移植后移植肾功能延迟的预测:逻辑回归与机器学习方法的比较
BMC Med Inform Decis Mak. 2015 Oct 14;15:83. doi: 10.1186/s12911-015-0206-y.
3
A comparative study on feature selection for a risk prediction model for colorectal cancer.用于结直肠癌风险预测模型的特征选择的比较研究。
Comput Methods Programs Biomed. 2019 Aug;177:219-229. doi: 10.1016/j.cmpb.2019.06.001. Epub 2019 Jun 4.
4
An Efficient Feature Selection Strategy Based on Multiple Support Vector Machine Technology with Gene Expression Data.基于基因表达数据的多支持向量机技术的高效特征选择策略。
Biomed Res Int. 2018 Aug 30;2018:7538204. doi: 10.1155/2018/7538204. eCollection 2018.
5
Seminal quality prediction using data mining methods.使用数据挖掘方法进行精液质量预测。
Technol Health Care. 2014;22(4):531-45. doi: 10.3233/THC-140816.
6
A reliable method for colorectal cancer prediction based on feature selection and support vector machine.基于特征选择和支持向量机的结直肠癌预测可靠方法。
Med Biol Eng Comput. 2019 Apr;57(4):901-912. doi: 10.1007/s11517-018-1930-0. Epub 2018 Nov 26.
7
Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies.全基因组关联研究中特征编码和分类器选择对疾病风险预测的影响
PLoS One. 2015 Aug 18;10(8):e0135832. doi: 10.1371/journal.pone.0135832. eCollection 2015.
8
Machine Learning-Based Method for Obesity Risk Evaluation Using Single-Nucleotide Polymorphisms Derived from Next-Generation Sequencing.基于机器学习的肥胖风险评估方法:利用来自下一代测序的单核苷酸多态性
J Comput Biol. 2018 Dec;25(12):1347-1360. doi: 10.1089/cmb.2018.0002. Epub 2018 Sep 8.
9
A machine learning pipeline for quantitative phenotype prediction from genotype data.基于基因型数据的定量表型预测的机器学习管道。
BMC Bioinformatics. 2010 Oct 26;11 Suppl 8(Suppl 8):S3. doi: 10.1186/1471-2105-11-S8-S3.
10
Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests.使用基于质量的两阶段随机森林进行全基因组关联数据分类和单核苷酸多态性选择。
BMC Genomics. 2015;16 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2164-16-S2-S5. Epub 2015 Jan 21.

引用本文的文献

1
Evaluation of penalized and machine learning methods for asthma disease prediction in the Korean Genome and Epidemiology Study (KoGES).评估惩罚和机器学习方法在韩国基因组与流行病学研究(KoGES)中对哮喘病的预测作用。
BMC Bioinformatics. 2024 Feb 2;25(1):56. doi: 10.1186/s12859-024-05677-x.
2
A single nucleotide polymorphism panel for individual identification and ancestry assignment in Caucasians and four East and Southeast Asian populations using a machine learning classifier.使用机器学习分类器的单核苷酸多态性面板用于白种人和四个东亚及东南亚人群的个体识别和血统归属。
Forensic Sci Med Pathol. 2019 Mar;15(1):67-74. doi: 10.1007/s12024-018-0071-y. Epub 2019 Jan 16.
3

本文引用的文献

1
Smoking and genetic risk variation across populations of European, Asian, and African American ancestry--a meta-analysis of chromosome 15q25.吸烟与欧洲、亚洲和非裔美国人种人群中遗传风险变异——染色体 15q25 的荟萃分析
Genet Epidemiol. 2012 May;36(4):340-51. doi: 10.1002/gepi.21627.
2
SNPedia: a wiki supporting personal genome annotation, interpretation and analysis.SNPedia:一个支持个人基因组注释、解释和分析的维基。
Nucleic Acids Res. 2012 Jan;40(Database issue):D1308-12. doi: 10.1093/nar/gkr798. Epub 2011 Dec 2.
3
Large-scale genome-wide association study of Asian population reveals genetic factors in FRMD4A and other loci influencing smoking initiation and nicotine dependence.
Risk Prediction Using Genome-Wide Association Studies on Type 2 Diabetes.
利用全基因组关联研究进行2型糖尿病风险预测
Genomics Inform. 2016 Dec;14(4):138-148. doi: 10.5808/GI.2016.14.4.138. Epub 2016 Dec 30.
4
Evaluation of Penalized and Nonpenalized Methods for Disease Prediction with Large-Scale Genetic Data.利用大规模遗传数据进行疾病预测的惩罚性和非惩罚性方法评估。
Biomed Res Int. 2015;2015:605891. doi: 10.1155/2015/605891. Epub 2015 Aug 4.
5
Application of high-dimensional feature selection: evaluation for genomic prediction in man.高维特征选择的应用:人类基因组预测评估
Sci Rep. 2015 May 19;5:10312. doi: 10.1038/srep10312.
大规模全基因组关联研究揭示了 FRMD4A 及其他位点的遗传因素对吸烟起始和尼古丁依赖的影响。
Hum Genet. 2012 Jun;131(6):1009-21. doi: 10.1007/s00439-011-1102-x. Epub 2011 Oct 18.
4
Web-based genome-wide association study identifies two novel loci and a substantial genetic component for Parkinson's disease.基于网络的全基因组关联研究鉴定出两个新的帕金森病发病位点和大量遗传因素。
PLoS Genet. 2011 Jun;7(6):e1002141. doi: 10.1371/journal.pgen.1002141. Epub 2011 Jun 23.
5
Genome Wide Association Study to predict severe asthma exacerbations in children using random forests classifiers.基于随机森林分类器的全基因组关联研究,用于预测儿童严重哮喘发作。
BMC Med Genet. 2011 Jun 30;12:90. doi: 10.1186/1471-2350-12-90.
6
Genetic risk profiling for prediction of type 2 diabetes.用于预测2型糖尿病的遗传风险分析
PLoS Curr. 2011 Jan 11;3:RRN1208. doi: 10.1371/currents.RRN1208.
7
Psoriasis prediction from genome-wide SNP profiles.基于全基因组单核苷酸多态性(SNP)图谱预测银屑病
BMC Dermatol. 2011 Jan 7;11:1. doi: 10.1186/1471-5945-11-1.
8
An epidemiological perspective on the future of direct-to-consumer personal genome testing.从流行病学角度看直接面向消费者的个人基因组检测的未来。
Investig Genet. 2010 Oct 4;1(1):10. doi: 10.1186/2041-2223-1-10.
9
A variable selection method for genome-wide association studies.一种全基因组关联研究的变量选择方法。
Bioinformatics. 2011 Jan 1;27(1):1-8. doi: 10.1093/bioinformatics/btq600. Epub 2010 Oct 29.
10
Risk prediction using genome-wide association studies.基于全基因组关联研究的风险预测。
Genet Epidemiol. 2010 Nov;34(7):643-52. doi: 10.1002/gepi.20509.