• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

局部病例对照抽样:不平衡数据集中的高效子抽样

LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS.

作者信息

Fithian William, Hastie Trevor

机构信息

Department of Statistics, Stanford University, 390 Serra Mall, Stanford, California 94305-4065, USA.

出版信息

Ann Stat. 2014 Oct 1;42(5):1693-1724. doi: 10.1214/14-AOS1220.

DOI:10.1214/14-AOS1220
PMID:25492979
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4258397/
Abstract

For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients θ*. By contrast, our estimator is consistent for θ* provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE-even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to [Formula: see text] if we multiply the baseline acceptance probabilities by > 1 (and weight points with acceptance probability greater than 1), taking roughly [Formula: see text] times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.

摘要

对于存在显著类别不平衡的分类问题,子采样可以降低计算成本,但代价是估计模型参数时方差会增大。我们提出了一种通过接受-拒绝方案在特征空间中局部调整类别平衡来高效地对逻辑回归进行子采样的方法。我们的方法推广了标准的病例对照采样,使用一个初步估计来优先选择那些在给定其特征的情况下响应条件罕见的示例。通过对参数进行事后分析调整来校正有偏子采样。该方法很简单,并且需要对整个数据集进行一次可并行化扫描。对于总体风险最小化系数θ*,在模型误设的情况下,标准的病例对照采样是不一致的。相比之下,只要初步估计是一致的,我们的估计量对于θ*就是一致的。此外,在正确设定且有一个一致、独立的初步估计的情况下,我们的估计量的渐近方差恰好是全样本极大似然估计(MLE)的两倍——即使所选子样本只占全数据集的极小部分,就像原始数据严重不平衡时那样。如果我们将基线接受概率乘以大于1的数(并对接受概率大于1的点进行加权),那么这个因子2会改进为[公式:见原文],此时子样本中纳入的数据点数量大约是原来的[公式:见原文]倍。在模拟数据和真实数据上的实验表明,我们的方法可以显著优于标准的病例对照子采样。

相似文献

1
LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS.局部病例对照抽样:不平衡数据集中的高效子抽样
Ann Stat. 2014 Oct 1;42(5):1693-1724. doi: 10.1214/14-AOS1220.
2
Optimal Subsampling for Large Sample Logistic Regression.大样本逻辑回归的最优子采样
J Am Stat Assoc. 2018;113(522):829-844. doi: 10.1080/01621459.2017.1292914. Epub 2018 Jun 6.
3
Sampling-based estimation for massive survival data with additive hazards model.基于抽样的加性风险模型在海量生存数据分析中的估计。
Stat Med. 2021 Jan 30;40(2):441-450. doi: 10.1002/sim.8783. Epub 2020 Nov 3.
4
Efficient posterior sampling for high-dimensional imbalanced logistic regression.高维不平衡逻辑回归的高效后验抽样
Biometrika. 2020 Jun 17;107(4):1005-1012. doi: 10.1093/biomet/asaa035. eCollection 2020 Dec.
5
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
6
Robust and efficient subsampling algorithms for massive data logistic regression.用于海量数据逻辑回归的稳健且高效的子采样算法。
J Appl Stat. 2023 Apr 26;51(8):1427-1445. doi: 10.1080/02664763.2023.2205611. eCollection 2024.
7
More efficient approximation of smoothing splines via space-filling basis selection.通过空间填充基选择对平滑样条进行更高效的近似。
Biometrika. 2020 Sep;107(3):723-735. doi: 10.1093/biomet/asaa019. Epub 2020 May 7.
8
Optimal subsampling for parametric accelerated failure time models with massive survival data.针对大规模生存数据的参数加速失效时间模型的最优抽样。
Stat Med. 2022 Nov 30;41(27):5421-5431. doi: 10.1002/sim.9576. Epub 2022 Sep 20.
9
Markov Subsampling Based on Huber Criterion.基于Huber准则的马尔可夫子采样
IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):2250-2262. doi: 10.1109/TNNLS.2022.3189069. Epub 2024 Feb 5.
10
Class-imbalanced subsampling lasso algorithm for discovering adverse drug reactions.用于发现药物不良反应的类不平衡子采样套索算法
Stat Methods Med Res. 2018 Mar;27(3):785-797. doi: 10.1177/0962280216643116. Epub 2016 Apr 25.

引用本文的文献

1
An optimal subsampling design for large-scale Cox model with censored data.针对含删失数据的大规模Cox模型的一种最优子抽样设计。
J Appl Stat. 2024 Nov 4;52(7):1315-1341. doi: 10.1080/02664763.2024.2423234. eCollection 2025.
2
Optimal Surrogate-Assisted Sampling for Cost-Efficient Validation of Electronic Health Record Outcomes.用于电子健康记录结果成本效益验证的最优代理辅助抽样
Stat Med. 2025 May;44(10-12):e70095. doi: 10.1002/sim.70095.
3
A SEMIPARAMETRIC METHOD FOR RISK PREDICTION USING INTEGRATED ELECTRONIC HEALTH RECORD DATA.

本文引用的文献

1
Connections between survey calibration estimators and semiparametric models for incomplete data.调查校准估计量与不完全数据半参数模型之间的联系。
Int Stat Rev. 2011 Aug;79(2):200-220. doi: 10.1111/j.1751-5823.2011.00138.x.
2
Statistical aspects of the analysis of data from retrospective studies of disease.疾病回顾性研究数据的统计分析方面
J Natl Cancer Inst. 1959 Apr;22(4):719-48.
3
Statistical methods in cancer research. Volume I - The analysis of case-control studies.癌症研究中的统计方法。第一卷——病例对照研究的分析
一种使用综合电子健康记录数据进行风险预测的半参数方法。
Ann Appl Stat. 2024 Dec;18(4):3318-3337. doi: 10.1214/24-AOAS1938. Epub 2024 Oct 31.
4
Robust and efficient subsampling algorithms for massive data logistic regression.用于海量数据逻辑回归的稳健且高效的子采样算法。
J Appl Stat. 2023 Apr 26;51(8):1427-1445. doi: 10.1080/02664763.2023.2205611. eCollection 2024.
5
SAT: a Surrogate-Assisted Two-wave case boosting sampling method, with application to EHR-based association studies.SAT:一种基于替代辅助的两波病例增强抽样方法,应用于基于电子健康记录的关联研究。
J Am Med Inform Assoc. 2022 Apr 13;29(5):918-927. doi: 10.1093/jamia/ocab267.
6
Variational Disentanglement for Rare Event Modeling.用于罕见事件建模的变分解缠
Proc AAAI Conf Artif Intell. 2021 May 18;35(12):10469-10477.
7
Native American Ancestry and Air Pollution Interact to Impact Bronchodilator Response in Puerto Rican Children with Asthma.美国原住民血统和空气污染相互作用,影响波多黎各哮喘儿童的支气管扩张剂反应。
Ethn Dis. 2021 Jan 21;31(1):77-88. doi: 10.18865/ed.31.1.77. eCollection 2021 Winter.
8
A semi-supervised model to predict regulatory effects of genetic variants at single nucleotide resolution using massively parallel reporter assays.一种使用大规模平行报告基因实验,在单核苷酸分辨率下预测遗传变异调控效应的半监督模型。
Bioinformatics. 2021 Aug 4;37(14):1953–1962. doi: 10.1093/bioinformatics/btab040. Epub 2021 Jan 30.
9
Efficient posterior sampling for high-dimensional imbalanced logistic regression.高维不平衡逻辑回归的高效后验抽样
Biometrika. 2020 Jun 17;107(4):1005-1012. doi: 10.1093/biomet/asaa035. eCollection 2020 Dec.
10
An epistatic interaction between pre-natal smoke exposure and socioeconomic status has a significant impact on bronchodilator drug response in African American youth with asthma.产前烟雾暴露与社会经济地位之间的上位性相互作用,对患有哮喘的非裔美国青少年的支气管扩张剂药物反应有显著影响。
BioData Min. 2020 Jul 3;13:7. doi: 10.1186/s13040-020-00218-7. eCollection 2020.
IARC Sci Publ. 1980(32):5-338.
4
Logistic regression methods for retrospective case-control studies using complex sampling procedures.使用复杂抽样程序的回顾性病例对照研究的逻辑回归方法。
Biometrics. 1986 Dec;42(4):955-60.
5
The design and analysis of case-control studies with biased sampling.存在偏倚抽样的病例对照研究的设计与分析
Biometrics. 1990 Dec;46(4):963-75.