• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

测量研究间变异性对估计预测误差的影响。

Measuring the effect of inter-study variability on estimating prediction error.

作者信息

Ma Shuyi, Sung Jaeyun, Magis Andrew T, Wang Yuliang, Geman Donald, Price Nathan D

机构信息

Institute for Systems Biology, Seattle, Washington, United States of America; Department of Chemical and Biomolecular Engineering, University of Illinois, Urbana, Illinois, United States of America.

Institute for Systems Biology, Seattle, Washington, United States of America; Asia Pacific Center for Theoretical Physics, Pohang, Gyeongbuk, Republic of Korea.

出版信息

PLoS One. 2014 Oct 17;9(10):e110840. doi: 10.1371/journal.pone.0110840. eCollection 2014.

DOI:10.1371/journal.pone.0110840
PMID:25330348
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4201588/
Abstract

BACKGROUND

The biomarker discovery field is replete with molecular signatures that have not translated into the clinic despite ostensibly promising performance in predicting disease phenotypes. One widely cited reason is lack of classification consistency, largely due to failure to maintain performance from study to study. This failure is widely attributed to variability in data collected for the same phenotype among disparate studies, due to technical factors unrelated to phenotypes (e.g., laboratory settings resulting in "batch-effects") and non-phenotype-associated biological variation in the underlying populations. These sources of variability persist in new data collection technologies.

METHODS

Here we quantify the impact of these combined "study-effects" on a disease signature's predictive performance by comparing two types of validation methods: ordinary randomized cross-validation (RCV), which extracts random subsets of samples for testing, and inter-study validation (ISV), which excludes an entire study for testing. Whereas RCV hardwires an assumption of training and testing on identically distributed data, this key property is lost in ISV, yielding systematic decreases in performance estimates relative to RCV. Measuring the RCV-ISV difference as a function of number of studies quantifies influence of study-effects on performance.

RESULTS

As a case study, we gathered publicly available gene expression data from 1,470 microarray samples of 6 lung phenotypes from 26 independent experimental studies and 769 RNA-seq samples of 2 lung phenotypes from 4 independent studies. We find that the RCV-ISV performance discrepancy is greater in phenotypes with few studies, and that the ISV performance converges toward RCV performance as data from additional studies are incorporated into classification.

CONCLUSIONS

We show that by examining how fast ISV performance approaches RCV as the number of studies is increased, one can estimate when "sufficient" diversity has been achieved for learning a molecular signature likely to translate without significant loss of accuracy to new clinical settings.

摘要

背景

生物标志物发现领域充斥着各种分子特征,尽管在预测疾病表型方面表面上表现出有前景的性能,但尚未转化为临床应用。一个被广泛引用的原因是缺乏分类一致性,这主要是由于不同研究之间未能保持性能。这种失败被广泛归因于不同研究中针对同一表型收集的数据存在变异性,这是由与表型无关的技术因素(例如导致“批次效应”的实验室设置)以及基础人群中与表型无关的生物学变异造成的。这些变异性来源在新的数据收集技术中依然存在。

方法

在此,我们通过比较两种验证方法来量化这些综合“研究效应”对疾病特征预测性能的影响:普通随机交叉验证(RCV),它提取随机样本子集进行测试;以及跨研究验证(ISV),它排除整个研究进行测试。虽然RCV硬性假定在同分布数据上进行训练和测试,但在ISV中这个关键属性丧失了,导致相对于RCV,性能估计出现系统性下降。将RCV - ISV差异作为研究数量的函数进行测量,可量化研究效应对性能的影响。

结果

作为一个案例研究,我们从26项独立实验研究的1470个6种肺表型的微阵列样本以及4项独立研究的769个2种肺表型的RNA测序样本中收集了公开可用的基因表达数据。我们发现,在研究较少的表型中,RCV - ISV性能差异更大,并且随着来自更多研究的数据被纳入分类,ISV性能趋向于RCV性能。

结论

我们表明,通过检查随着研究数量增加ISV性能接近RCV的速度,人们可以估计何时已经实现了“足够”的多样性,以便学习到一个可能在不显著损失准确性的情况下转化到新临床环境的分子特征。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e74/4201588/6e9922e03f2a/pone.0110840.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e74/4201588/5062da8f07ec/pone.0110840.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e74/4201588/88b0fbeebac4/pone.0110840.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e74/4201588/6e9922e03f2a/pone.0110840.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e74/4201588/5062da8f07ec/pone.0110840.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e74/4201588/88b0fbeebac4/pone.0110840.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e74/4201588/6e9922e03f2a/pone.0110840.g003.jpg

相似文献

1
Measuring the effect of inter-study variability on estimating prediction error.测量研究间变异性对估计预测误差的影响。
PLoS One. 2014 Oct 17;9(10):e110840. doi: 10.1371/journal.pone.0110840. eCollection 2014.
2
Identification of transcriptional subtypes in lung adenocarcinoma and squamous cell carcinoma through integrative analysis of microarray and RNA sequencing data.通过微阵列和 RNA 测序数据的综合分析鉴定肺腺癌和鳞状细胞癌中的转录亚型。
Sci Rep. 2021 Apr 22;11(1):8709. doi: 10.1038/s41598-021-88209-4.
3
ASCL1-coexpression profiling but not single gene expression profiling defines lung adenocarcinomas of neuroendocrine nature with poor prognosis.ASCL1 共表达谱分析而非单个基因表达谱分析定义了具有不良预后的神经内分泌性质的肺腺癌。
Lung Cancer. 2012 Jan;75(1):119-25. doi: 10.1016/j.lungcan.2011.05.028. Epub 2011 Jul 6.
4
Proliferation genes in lung development associated with the prognosis of lung adenocarcinoma but not squamous cell carcinoma.肺发育中的增殖基因与肺腺癌的预后相关,但与肺鳞状细胞癌无关。
Cancer Sci. 2018 Feb;109(2):308-316. doi: 10.1111/cas.13456. Epub 2017 Dec 27.
5
Validation of the Lung Subtyping Panel in Multiple Fresh-Frozen and Formalin-Fixed, Paraffin-Embedded Lung Tumor Gene Expression Data Sets.肺亚型检测板在多个新鲜冷冻及福尔马林固定、石蜡包埋肺肿瘤基因表达数据集中的验证
Arch Pathol Lab Med. 2016 Jun;140(6):536-42. doi: 10.5858/arpa.2015-0113-OA. Epub 2015 Oct 2.
6
Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes.癌症微阵列数据的跨平台分析改进了基于基因表达的表型分类。
BMC Bioinformatics. 2005 Nov 4;6:265. doi: 10.1186/1471-2105-6-265.
7
Non-overlapping and non-cell-type-specific gene expression signatures predict lung cancer survival.非重叠且非细胞类型特异性的基因表达特征可预测肺癌生存率。
J Clin Oncol. 2008 Feb 20;26(6):877-83. doi: 10.1200/JCO.2007.13.1516.
8
Integrative genomic analyses identify BRF2 as a novel lineage-specific oncogene in lung squamous cell carcinoma.综合基因组分析鉴定 BRF2 为肺鳞癌中一个新的谱系特异性癌基因。
PLoS Med. 2010 Jul 27;7(7):e1000315. doi: 10.1371/journal.pmed.1000315.
9
Differential distribution improves gene selection stability and has competitive classification performance for patient survival.差异分布提高了基因选择的稳定性,并在患者生存的分类性能上具有竞争力。
Nucleic Acids Res. 2016 Jul 27;44(13):e119. doi: 10.1093/nar/gkw444. Epub 2016 May 17.
10
Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data.特征特异性分位数归一化可使用基因表达数据对分子亚型进行跨平台分类。
Bioinformatics. 2018 Jun 1;34(11):1868-1874. doi: 10.1093/bioinformatics/bty026.

引用本文的文献

1
MultiCubeNet: Multitask deep learning for molecular subtyping and prognostic prediction in gliomas.多立方体网络:用于神经胶质瘤分子亚型分类和预后预测的多任务深度学习
Neurooncol Adv. 2025 Apr 28;7(1):vdaf079. doi: 10.1093/noajnl/vdaf079. eCollection 2025 Jan-Dec.
2
Gut Microbiome Wellness Index 2 enhances health status prediction from gut microbiome taxonomic profiles.肠道微生物组健康指数 2 增强了从肠道微生物组分类特征预测健康状况的能力。
Nat Commun. 2024 Aug 28;15(1):7447. doi: 10.1038/s41467-024-51651-9.
3
Meta-analysis reveals obesity associated gut microbial alteration patterns and reproducible contributors of functional shift.

本文引用的文献

1
Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories.高通量 mRNA 和 small RNA 测序在实验室间的可重复性。
Nat Biotechnol. 2013 Nov;31(11):1015-22. doi: 10.1038/nbt.2702. Epub 2013 Sep 15.
2
Multi-study integration of brain cancer transcriptomes reveals organ-level molecular signatures.多研究整合脑癌转录组揭示器官水平的分子特征。
PLoS Comput Biol. 2013;9(7):e1003148. doi: 10.1371/journal.pcbi.1003148. Epub 2013 Jul 25.
3
A high-dimensional, deep-sequencing study of lung adenocarcinoma in female never-smokers.
荟萃分析揭示了肥胖相关的肠道微生物改变模式和功能转变的可重现贡献因素。
Gut Microbes. 2024 Jan-Dec;16(1):2304900. doi: 10.1080/19490976.2024.2304900. Epub 2024 Jan 24.
4
Gut Microbiome Wellness Index 2 for Enhanced Health Status Prediction from Gut Microbiome Taxonomic Profiles.用于从肠道微生物群分类学概况增强健康状况预测的肠道微生物群健康指数2
bioRxiv. 2023 Oct 2:2023.09.30.560294. doi: 10.1101/2023.09.30.560294.
5
Robustifying genomic classifiers to batch effects via ensemble learning.通过集成学习使基因组分类器稳健化以应对批次效应。
Bioinformatics. 2021 Jul 12;37(11):1521-1527. doi: 10.1093/bioinformatics/btaa986.
6
A predictive index for health status using species-level gut microbiome profiling.利用物种水平的肠道微生物组谱进行健康状况预测的指标。
Nat Commun. 2020 Sep 15;11(1):4635. doi: 10.1038/s41467-020-18476-8.
7
The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models.不同来源的异质性对基因组预测模型准确性损失的影响。
Biostatistics. 2020 Apr 1;21(2):253-268. doi: 10.1093/biostatistics/kxy044.
8
Training replicable predictors in multiple studies.在多项研究中训练可复制的预测因子。
Proc Natl Acad Sci U S A. 2018 Mar 13;115(11):2578-2583. doi: 10.1073/pnas.1708283115. Epub 2018 Mar 12.
9
Database resources of the National Center for Biotechnology Information.国家生物技术信息中心数据库资源。
Nucleic Acids Res. 2018 Jan 4;46(D1):D8-D13. doi: 10.1093/nar/gkx1095.
10
A Cell-Surface Membrane Protein Signature for Glioblastoma.胶质母细胞瘤的细胞膜蛋白特征。
Cell Syst. 2017 May 24;4(5):516-529.e7. doi: 10.1016/j.cels.2017.03.004. Epub 2017 Mar 29.
女性从不吸烟者肺腺癌的高维、深度测序研究。
PLoS One. 2013;8(2):e55596. doi: 10.1371/journal.pone.0055596. Epub 2013 Feb 6.
4
STAR: ultrafast universal RNA-seq aligner.STAR:超快通用 RNA-seq 对齐工具。
Bioinformatics. 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635. Epub 2012 Oct 25.
5
Batch effect removal methods for microarray gene expression data integration: a survey.批量效应去除方法在微阵列基因表达数据整合中的应用:综述。
Brief Bioinform. 2013 Jul;14(4):469-90. doi: 10.1093/bib/bbs037. Epub 2012 Jul 31.
6
Molecular signatures from omics data: from chaos to consensus.组学数据的分子特征:从混沌到共识。
Biotechnol J. 2012 Aug;7(8):946-57. doi: 10.1002/biot.201100305. Epub 2012 Apr 23.
7
A transforming KIF5B and RET gene fusion in lung adenocarcinoma revealed from whole-genome and transcriptome sequencing.全基因组和转录组测序揭示肺腺癌中存在一个转化的 KIF5B 和 RET 基因融合。
Genome Res. 2012 Mar;22(3):436-45. doi: 10.1101/gr.133645.111. Epub 2011 Dec 22.
8
Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods.去除表达微阵列数据分析中的批次效应:六种批次调整方法的评估。
PLoS One. 2011 Feb 28;6(2):e17238. doi: 10.1371/journal.pone.0017238.
9
Systems approaches to molecular cancer diagnostics.分子癌症诊断的系统方法。
Discov Med. 2010 Dec;10(55):531-42.
10
The Gene Expression Barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes.基因表达条形码:利用公共数据存储库开始编目人类和小鼠转录组。
Nucleic Acids Res. 2011 Jan;39(Database issue):D1011-5. doi: 10.1093/nar/gkq1259.