• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在正例未标记设置中检测预测模型的偏差验证:疾病基因优先级排序案例研究

Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study.

作者信息

Molotkov Ivan, Artomov Mykyta

机构信息

The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, United States.

Department of Pediatrics, The Ohio State University, Columbus, OH, United States.

出版信息

Bioinform Adv. 2023 Sep 14;3(1):vbad128. doi: 10.1093/bioadv/vbad128. eCollection 2023.

DOI:10.1093/bioadv/vbad128
PMID:37745001
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10517638/
Abstract

MOTIVATION

Positive-unlabeled data consists of points with either positive or unknown labels. It is widespread in medical, genetic, and biological settings, creating a high demand for predictive positive-unlabeled models. The performance of such models is usually estimated using validation sets, assumed to be selected completely at random (SCAR) from known positive examples. For certain metrics, this assumption enables unbiased performance estimation when treating positive-unlabeled data as positive/negative. However, the SCAR assumption is often adopted without proper justifications, simply for the sake of convenience.

RESULTS

We provide an algorithm that under the weak assumptions of a lower bound on the number of positive examples can test for the violation of the SCAR assumption. Applying it to the problem of gene prioritization for complex genetic traits, we illustrate that the SCAR assumption is often violated there, causing the inflation of performance estimates, which we refer to as validation bias. We estimate the potential impact of validation bias on performance estimation. Our analysis reveals that validation bias is widespread in gene prioritization data and can significantly overestimate the performance of models. This finding elucidates the discrepancy between the reported good performance of models and their limited practical applications.

AVAILABILITY AND IMPLEMENTATION

Python code with examples of application of the validation bias detection algorithm is available at github.com/ArtomovLab/ValidationBias.

摘要

动机

正未标记数据由具有正标签或未知标签的点组成。它在医学、遗传学和生物学环境中广泛存在,因此对预测正未标记模型有很高的需求。此类模型的性能通常使用验证集进行估计,假设验证集是从已知正例中完全随机(SCAR)选择的。对于某些指标,在将正未标记数据视为正/负数据时,此假设可实现无偏性能估计。然而,SCAR假设常常未经适当论证就被采用,仅仅是为了方便。

结果

我们提供了一种算法,在正例数量下限的弱假设下,可以测试SCAR假设是否被违反。将其应用于复杂遗传性状的基因优先级排序问题,我们表明在该问题中SCAR假设常常被违反,导致性能估计出现偏差,我们将其称为验证偏差。我们估计了验证偏差对性能估计的潜在影响。我们的分析表明,验证偏差在基因优先级排序数据中普遍存在,并且会显著高估模型的性能。这一发现揭示了模型报告的良好性能与其有限的实际应用之间的差异。

可用性和实现方式

验证偏差检测算法应用示例的Python代码可在github.com/ArtomovLab/ValidationBias获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/098f/10517638/34d9a7e65743/vbad128f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/098f/10517638/12002bd34921/vbad128f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/098f/10517638/34d9a7e65743/vbad128f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/098f/10517638/12002bd34921/vbad128f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/098f/10517638/34d9a7e65743/vbad128f2.jpg

相似文献

1
Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study.在正例未标记设置中检测预测模型的偏差验证:疾病基因优先级排序案例研究
Bioinform Adv. 2023 Sep 14;3(1):vbad128. doi: 10.1093/bioadv/vbad128. eCollection 2023.
2
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
3
Precision and bias of a normal finite mixture distribution model to analyze twin data when zygosity is unknown: simulations and application to IQ phenotypes on a large sample of twin pairs.当合子性未知时,用于分析双胞胎数据的正态有限混合分布模型的精度和偏差:模拟及在大量双胞胎对智商表型上的应用
Behav Genet. 2006 Nov;36(6):935-46. doi: 10.1007/s10519-006-9086-3. Epub 2006 Jun 20.
4
A robust DF-REML framework for variance components estimation in genetic studies.一种稳健的 DF-REML 框架,用于遗传研究中的方差分量估计。
Bioinformatics. 2017 Nov 15;33(22):3584-3594. doi: 10.1093/bioinformatics/btx457.
5
Improved variance estimation of classification performance via reduction of bias caused by small sample size.通过减少小样本量导致的偏差来改进分类性能的方差估计。
BMC Bioinformatics. 2006 Mar 13;7:127. doi: 10.1186/1471-2105-7-127.
6
Overview of the epidemiology methods and applications: strengths and limitations of observational study designs.流行病学方法与应用概述:观察性研究设计的优势与局限性。
Crit Rev Food Sci Nutr. 2010;50 Suppl 1(s1):10-2. doi: 10.1080/10408398.2010.526838.
7
NIAPU: network-informed adaptive positive-unlabeled learning for disease gene identification.NIAPU:用于疾病基因识别的基于网络信息的自适应阳性无标签学习。
Bioinformatics. 2023 Feb 3;39(2). doi: 10.1093/bioinformatics/btac848.
8
PHOTONAI-A Python API for rapid machine learning model development.PHOTONAI-用于快速机器学习模型开发的 Python API。
PLoS One. 2021 Jul 21;16(7):e0254062. doi: 10.1371/journal.pone.0254062. eCollection 2021.
9
Polar labeling: silver standard algorithm for training disease classifiers.极性标记:用于训练疾病分类器的银标准算法。
Bioinformatics. 2020 May 1;36(10):3200-3206. doi: 10.1093/bioinformatics/btaa088.
10
Adaptive one-class Gaussian processes allow accurate prioritization of oncology drug targets.自适应单类高斯过程可准确优先考虑肿瘤药物靶点。
Bioinformatics. 2021 Jun 16;37(10):1420-1427. doi: 10.1093/bioinformatics/btaa968.

本文引用的文献

1
Global Biobank Meta-analysis Initiative: Powering genetic discovery across human disease.全球生物样本库荟萃分析计划:推动人类疾病的基因发现
Cell Genom. 2022 Oct 12;2(10):100192. doi: 10.1016/j.xgen.2022.100192.
2
"Guilt by association" is not competitive with genetic association for identifying autism risk genes.“关联有罪”在鉴定自闭症风险基因方面并不能与遗传关联相竞争。
Sci Rep. 2021 Aug 5;11(1):15950. doi: 10.1038/s41598-021-95321-y.
3
Prioritization of disease genes from GWAS using ensemble-based positive-unlabeled learning.基于集成的正负无标签学习的 GWAS 疾病基因优先级排序。
Eur J Hum Genet. 2021 Oct;29(10):1527-1535. doi: 10.1038/s41431-021-00930-w. Epub 2021 Jul 19.
4
Array programming with NumPy.使用 NumPy 进行数组编程。
Nature. 2020 Sep;585(7825):357-362. doi: 10.1038/s41586-020-2649-2. Epub 2020 Sep 16.
5
Genome-wide prediction and prioritization of human aging genes by data fusion: a machine learning approach.基于数据融合的人类衰老基因的全基因组预测和优先级排序:一种机器学习方法。
BMC Genomics. 2019 Nov 9;20(1):832. doi: 10.1186/s12864-019-6140-0.
6
A Survey of Gene Prioritization Tools for Mendelian and Complex Human Diseases.孟德尔和复杂人类疾病基因优先级排序工具综述
J Integr Bioinform. 2019 Sep 9;16(4):20180069. doi: 10.1515/jib-2018-0069.
7
Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies.估计正例未标记学习中的分类准确率:特征描述与校正策略。
Pac Symp Biocomput. 2019;24:124-135.
8
The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers.COSMIC 癌症基因目录:描述所有人类癌症中的遗传功能障碍。
Nat Rev Cancer. 2018 Nov;18(11):696-705. doi: 10.1038/s41568-018-0060-1.
9
C-PUGP: A cluster-based positive unlabeled learning method for disease gene prediction and prioritization.C-PUGP:一种基于聚类的阳性无标签学习方法,用于疾病基因预测和优先级排序。
Comput Biol Chem. 2018 Oct;76:23-31. doi: 10.1016/j.compbiolchem.2018.05.022. Epub 2018 Jun 1.
10
Disease genes prediction by HMM based PU-learning using gene expression profiles.基于基因表达谱的 HMM 基于 PU 学习的疾病基因预测。
J Biomed Inform. 2018 May;81:102-111. doi: 10.1016/j.jbi.2018.03.006. Epub 2018 Mar 20.