• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

校正两阶段病例对照研究中样本选择偏倚的分类器

Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies.

作者信息

Krautenbacher Norbert, Theis Fabian J, Fuchs Christiane

机构信息

Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, Munich, Germany.

Department of Mathematics, Technische Universität München, Munich, Germany.

出版信息

Comput Math Methods Med. 2017;2017:7847531. doi: 10.1155/2017/7847531. Epub 2017 Sep 24.

DOI:10.1155/2017/7847531
PMID:29312464
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5632994/
Abstract

Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package .

摘要

流行病学研究经常使用分层数据,其中罕见的结果或暴露被人为地富集。这种设计可以提高关联测试的精度,但在对未分层数据应用分类器时会扭曲预测。有几种方法可以校正这种所谓的样本选择偏差,但其性能仍不明确,尤其是对于机器学习分类器。重点关注两阶段病例对照研究,我们旨在评估在何种情况下应进行何种校正,并获得适用于机器学习技术(特别是随机森林)的方法。我们提出了两种基于重采样的新方法来模拟原始数据和协方差结构:随机逆概率过采样和参数逆概率装袋。我们在理论上以及在模拟数据和真实数据上比较了随机森林和其他分类器的所有技术。实证结果表明,随机森林仅从我们提出的参数逆概率装袋中受益。对于其他分类器,校正大多是有利的,并且方法表现一致。我们讨论了不适当分布假设的后果以及随机森林和其他分类器之间不同行为的原因。总之,我们为在有偏差样本上训练分类器时选择校正方法提供了指导。对于随机森林,如果分布假设大致满足,我们的方法优于现有技术程序。我们在R包中提供了我们的实现。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/14d955864f91/CMMM2017-7847531.alg.002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/8113e80e7f14/CMMM2017-7847531.001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/70dd4dd20746/CMMM2017-7847531.002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/a9ac6592ca7d/CMMM2017-7847531.003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/ecded61fdb58/CMMM2017-7847531.004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/2a4785ec1ce7/CMMM2017-7847531.005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/22adbf1de569/CMMM2017-7847531.006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/a4593086f017/CMMM2017-7847531.007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/d6a9257e3fac/CMMM2017-7847531.008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/eaca8029b042/CMMM2017-7847531.009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/025b0876d1c6/CMMM2017-7847531.alg.001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/14d955864f91/CMMM2017-7847531.alg.002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/8113e80e7f14/CMMM2017-7847531.001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/70dd4dd20746/CMMM2017-7847531.002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/a9ac6592ca7d/CMMM2017-7847531.003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/ecded61fdb58/CMMM2017-7847531.004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/2a4785ec1ce7/CMMM2017-7847531.005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/22adbf1de569/CMMM2017-7847531.006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/a4593086f017/CMMM2017-7847531.007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/d6a9257e3fac/CMMM2017-7847531.008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/eaca8029b042/CMMM2017-7847531.009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/025b0876d1c6/CMMM2017-7847531.alg.001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/752d/5632994/14d955864f91/CMMM2017-7847531.alg.002.jpg

相似文献

1
Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies.校正两阶段病例对照研究中样本选择偏倚的分类器
Comput Math Methods Med. 2017;2017:7847531. doi: 10.1155/2017/7847531. Epub 2017 Sep 24.
2
Merits of random forests emerge in evaluation of chemometric classifiers by external validation.随机森林在化学计量分类器的外部验证评估中的优势凸显。
Anal Chim Acta. 2013 Nov 1;801:22-33. doi: 10.1016/j.aca.2013.09.027. Epub 2013 Sep 23.
3
A theoretical analysis of bagging as a linear combination of classifiers.作为分类器线性组合的装袋法理论分析。
IEEE Trans Pattern Anal Mach Intell. 2008 Jul;30(7):1293-9. doi: 10.1109/TPAMI.2008.30.
4
Classifier design for computer-aided diagnosis: effects of finite sample size on the mean performance of classical and neural network classifiers.用于计算机辅助诊断的分类器设计:有限样本量对经典分类器和神经网络分类器平均性能的影响。
Med Phys. 1999 Dec;26(12):2654-68. doi: 10.1118/1.598805.
5
Applying machine learning to predict real-world individual treatment effects: insights from a virtual patient cohort.应用机器学习预测真实世界的个体治疗效果:来自虚拟患者队列的见解。
J Am Med Inform Assoc. 2019 Oct 1;26(10):977-988. doi: 10.1093/jamia/ocz036.
6
Objective Assessment of Physical Activity: Classifiers for Public Health.身体活动的客观评估:公共卫生分类器
Med Sci Sports Exerc. 2016 May;48(5):951-7. doi: 10.1249/MSS.0000000000000841.
7
Contemporary QSAR classifiers compared.当代定量构效关系分类器比较。
J Chem Inf Model. 2007 Jan-Feb;47(1):219-27. doi: 10.1021/ci600332j.
8
A Machine Learning Ensemble Classifier for Early Prediction of Diabetic Retinopathy.机器学习集成分类器在糖尿病视网膜病变早期预测中的应用。
J Med Syst. 2017 Nov 9;41(12):201. doi: 10.1007/s10916-017-0853-x.
9
Effect of finite sample size on feature selection and classification: a simulation study.有限样本大小对特征选择和分类的影响:一项模拟研究。
Med Phys. 2010 Feb;37(2):907-20. doi: 10.1118/1.3284974.
10
A novel ensemble machine learning for robust microarray data classification.一种用于稳健微阵列数据分类的新型集成机器学习方法。
Comput Biol Med. 2006 Jun;36(6):553-73. doi: 10.1016/j.compbiomed.2005.04.001. Epub 2005 Jun 23.

引用本文的文献

1
GeM-LR: Discovering predictive biomarkers for small datasets in vaccine studies.GeM-LR:在疫苗研究中发现小数据集的预测生物标志物。
PLoS Comput Biol. 2024 Nov 14;20(11):e1012581. doi: 10.1371/journal.pcbi.1012581. eCollection 2024 Nov.
2
DEPLOYR: a technical framework for deploying custom real-time machine learning models into the electronic medical record.DEPLOYR:一个将定制的实时机器学习模型部署到电子病历中的技术框架。
J Am Med Inform Assoc. 2023 Aug 18;30(9):1532-1542. doi: 10.1093/jamia/ocad114.
3
Catalonia Suicide Risk Code Epidemiology (CSRC-Epi) study: protocol for a population-representative nested case-control study of suicide attempts in Catalonia, Spain.

本文引用的文献

1
Using Inverse Probability Bootstrap Sampling to Eliminate Sample Induced Bias in Model Based Analysis of Unequal Probability Samples.使用逆概率自助抽样法消除不等概率样本基于模型分析中的样本诱导偏差。
PLoS One. 2015 Jun 30;10(6):e0131765. doi: 10.1371/journal.pone.0131765. eCollection 2015.
2
pROC: an open-source package for R and S+ to analyze and compare ROC curves.pROC:一个用于 R 和 S+的开源软件包,用于分析和比较 ROC 曲线。
BMC Bioinformatics. 2011 Mar 17;12:77. doi: 10.1186/1471-2105-12-77.
3
Assessing risk prediction models in case-control studies using semiparametric and nonparametric methods.
加泰罗尼亚自杀风险代码流行病学(CSRC-Epi)研究:西班牙加泰罗尼亚地区自杀未遂的人群代表性巢式病例对照研究方案。
BMJ Open. 2020 Jul 12;10(7):e037365. doi: 10.1136/bmjopen-2020-037365.
应用半参数和非参数方法评估病例对照研究中的风险预测模型。
Stat Med. 2010 Jun 15;29(13):1391-410. doi: 10.1002/sim.3876.
4
A simple method to adjust clinical prediction models to local circumstances.一种将临床预测模型调整至适用于当地情况的简单方法。
Can J Anaesth. 2009 Mar;56(3):194-201. doi: 10.1007/s12630-009-9041-x. Epub 2009 Feb 7.
5
Baseline integrated behavioural and biological assessment among most at-risk populations in six high-prevalence states of India: design and implementation challenges.印度六个高流行率邦中高危人群的基线综合行为与生物学评估:设计与实施挑战
AIDS. 2008 Dec;22 Suppl 5:S17-34. doi: 10.1097/01.aids.0000343761.77702.04.
6
An empirical comparison of respondent-driven sampling, time location sampling, and snowball sampling for behavioral surveillance in men who have sex with men, Fortaleza, Brazil.巴西福塔莱萨市针对男男性行为者进行行为监测时,应答驱动抽样、时间地点抽样和滚雪球抽样的实证比较。
AIDS Behav. 2008 Jul;12(4 Suppl):S97-104. doi: 10.1007/s10461-008-9390-4. Epub 2008 Apr 4.
7
ROCR: visualizing classifier performance in R.ROCR:在R语言中可视化分类器性能
Bioinformatics. 2005 Oct 15;21(20):3940-1. doi: 10.1093/bioinformatics/bti623. Epub 2005 Aug 11.
8
Two-stage designs for gene-disease association studies with sample size constraints.具有样本量限制的基因-疾病关联研究的两阶段设计。
Biometrics. 2004 Sep;60(3):589-97. doi: 10.1111/j.0006-341X.2004.00207.x.
9
Validation and updating of predictive logistic regression models: a study on sample size and shrinkage.预测性逻辑回归模型的验证与更新:样本量与收缩的研究
Stat Med. 2004 Aug 30;23(16):2567-86. doi: 10.1002/sim.1844.
10
Health-related characteristics of men who have sex with men: a comparison of those living in "gay ghettos" with those living elsewhere.男男性行为者的健康相关特征:居住在“同性恋聚居区”的人与居住在其他地方的人之间的比较。
Am J Public Health. 2001 Jun;91(6):980-3. doi: 10.2105/ajph.91.6.980.