样本大小和异质性对生物标志物发现的影响：合成和真实数据评估。

Effect of size and heterogeneity of samples on biomarker discovery: synthetic and real data assessment.

机构信息

Information Engineering Department, University of Padova, Padova, Italy.

出版信息

PLoS One. 2012;7(3):e32200. doi: 10.1371/journal.pone.0032200. Epub 2012 Mar 5.

DOI:10.1371/journal.pone.0032200

PMID:22403633

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3293892/

Abstract

MOTIVATION

The identification of robust lists of molecular biomarkers related to a disease is a fundamental step for early diagnosis and treatment. However, methodologies for the discovery of biomarkers using microarray data often provide results with limited overlap. These differences are imputable to 1) dataset size (few subjects with respect to the number of features); 2) heterogeneity of the disease; 3) heterogeneity of experimental protocols and computational pipelines employed in the analysis. In this paper, we focus on the first two issues and assess, both on simulated (through an in silico regulation network model) and real clinical datasets, the consistency of candidate biomarkers provided by a number of different methods.

METHODS

We extensively simulated the effect of heterogeneity characteristic of complex diseases on different sets of microarray data. Heterogeneity was reproduced by simulating both intrinsic variability of the population and the alteration of regulatory mechanisms. Population variability was simulated by modeling evolution of a pool of subjects; then, a subset of them underwent alterations in regulatory mechanisms so as to mimic the disease state.

RESULTS

The simulated data allowed us to outline advantages and drawbacks of different methods across multiple studies and varying number of samples and to evaluate precision of feature selection on a benchmark with known biomarkers. Although comparable classification accuracy was reached by different methods, the use of external cross-validation loops is helpful in finding features with a higher degree of precision and stability. Application to real data confirmed these results.

摘要

动机

鉴定与疾病相关的稳健分子生物标志物列表是早期诊断和治疗的基础步骤。然而，使用微阵列数据发现生物标志物的方法通常提供的结果重叠有限。这些差异可归因于 1）数据集大小（相对于特征数量，样本数量较少）；2）疾病的异质性；3）分析中使用的实验方案和计算管道的异质性。在本文中，我们专注于前两个问题，并在模拟（通过计算机调控网络模型）和真实临床数据集上评估了许多不同方法提供的候选生物标志物的一致性。

方法

我们广泛模拟了复杂疾病特有的异质性对不同微阵列数据集的影响。通过模拟群体的固有变异性和调控机制的改变来再现异质性。通过对一组主题的演变进行建模来模拟群体变异性；然后，其中一部分经历了调控机制的改变，以模拟疾病状态。

结果

模拟数据使我们能够在多个研究中概述不同方法的优缺点，以及在不同数量的样本和评估具有已知生物标志物的基准特征选择精度方面的优势。尽管不同方法达到了可比的分类准确性，但使用外部交叉验证循环有助于找到具有更高精度和稳定性的特征。对真实数据的应用证实了这些结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c94d/3293892/9910ee42a4c4/pone.0032200.g001.jpg

相似文献

Effect of size and heterogeneity of samples on biomarker discovery: synthetic and real data assessment.样本大小和异质性对生物标志物发现的影响：合成和真实数据评估。

PLoS One. 2012;7(3):e32200. doi: 10.1371/journal.pone.0032200. Epub 2012 Mar 5.

The feature selection bias problem in relation to high-dimensional gene data.与高维基因数据相关的特征选择偏差问题。

Artif Intell Med. 2016 Jan;66:63-71. doi: 10.1016/j.artmed.2015.11.001. Epub 2015 Nov 14.

Robustness of chemometrics-based feature selection methods in early cancer detection and biomarker discovery.基于化学计量学的特征选择方法在早期癌症检测和生物标志物发现中的稳健性。

Stat Appl Genet Mol Biol. 2013 Mar 13;12(2):207-23. doi: 10.1515/sagmb-2012-0067.

A two-stage hybrid biomarker selection method based on ensemble filter and binary differential evolution incorporating binary African vultures optimization.基于集成筛选器和二进制差分进化并结合二进制非洲秃鹫优化的两阶段混合生物标志物选择方法。

BMC Bioinformatics. 2023 Apr 4;24(1):130. doi: 10.1186/s12859-023-05247-7.

Network-based logistic regression integration method for biomarker identification.用于生物标志物识别的基于网络的逻辑回归集成方法。

BMC Syst Biol. 2018 Dec 31;12(Suppl 9):135. doi: 10.1186/s12918-018-0657-8.

Comparison of methods for the detection of outliers and associated biomarkers in mislabeled omics data.比较用于检测组学数据中标记错误的异常值和相关生物标志物的方法。

BMC Bioinformatics. 2020 Aug 14;21(1):357. doi: 10.1186/s12859-020-03653-9.

Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study.使用偏最小二乘判别分析进行组学数据分析时，交叉验证中的过度乐观：一项系统研究。

Anal Bioanal Chem. 2018 Sep;410(23):5981-5992. doi: 10.1007/s00216-018-1217-1. Epub 2018 Jun 29.

Feature selection and nearest centroid classification for protein mass spectrometry.蛋白质质谱的特征选择与最近质心分类

BMC Bioinformatics. 2005 Mar 23;6:68. doi: 10.1186/1471-2105-6-68.

A centroid-based gene selection method for microarray data classification.一种基于质心的微阵列数据分类基因选择方法。

J Theor Biol. 2016 Jul 7;400:32-41. doi: 10.1016/j.jtbi.2016.03.034. Epub 2016 Apr 4.

A critical assessment of feature selection methods for biomarker discovery in clinical proteomics.临床蛋白质组学中生物标志物发现的特征选择方法的批判性评估。

Mol Cell Proteomics. 2013 Jan;12(1):263-76. doi: 10.1074/mcp.M112.022566. Epub 2012 Oct 31.

引用本文的文献

Comparative Targeted Genome Profiling between Solid and Liquid Biopsies in Gastroenteropancreatic Neuroendocrine Neoplasms: A Proof-of-Concept Pilot Study.胃肠胰神经内分泌肿瘤中实体活检与液体活检的靶向基因组比较分析：一项概念验证性初步研究

Neuroendocrinology. 2025;115(5):422-433. doi: 10.1159/000541346. Epub 2024 Oct 24.

Machine learning-based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease.基于机器学习的特征选择搜索稳定的微生物生物标志物：在炎症性肠病中的应用。

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad083. Epub 2023 Oct 26.

Bayesian multi-source regression and monocyte-associated gene expression predict BCL-2 inhibitor resistance in acute myeloid leukemia.贝叶斯多源回归和单核细胞相关基因表达可预测急性髓系白血病对BCL-2抑制剂的耐药性。

NPJ Precis Oncol. 2021 Jul 23;5(1):71. doi: 10.1038/s41698-021-00209-9.

Biomarker Categorization in Transcriptomic Meta-Analysis by Concordant Patterns With Application to Pan-Cancer Studies.通过一致性模式在转录组元分析中进行生物标志物分类及其在泛癌研究中的应用

Front Genet. 2021 Jul 2;12:651546. doi: 10.3389/fgene.2021.651546. eCollection 2021.

The hidden information in patient-reported outcomes and clinician-assessed outcomes: multiple sclerosis as a proof of concept of a machine learning approach.患者报告结局和临床医生评估结局中的隐藏信息：以多发性硬化症作为机器学习方法的概念验证。

Neurol Sci. 2020 Feb;41(2):459-462. doi: 10.1007/s10072-019-04093-x. Epub 2019 Oct 28.

Early urinary biomarkers of diabetic nephropathy in type 1 diabetes mellitus show involvement of kallikrein-kinin system.1型糖尿病中糖尿病肾病的早期尿液生物标志物显示激肽释放酶-激肽系统受累。

BMC Nephrol. 2017 Mar 30;18(1):112. doi: 10.1186/s12882-017-0519-4.

Reproducible detection of disease-associated markers from gene expression data.从基因表达数据中可重复检测疾病相关标志物。

BMC Med Genomics. 2016 Aug 18;9(1):53. doi: 10.1186/s12920-016-0214-5.

Evaluation of short-term predictors of glucose concentration in type 1 diabetes combining feature ranking with regression models.结合特征排序与回归模型评估1型糖尿病患者血糖浓度的短期预测指标

Med Biol Eng Comput. 2015 Dec;53(12):1305-18. doi: 10.1007/s11517-015-1263-1. Epub 2015 Mar 15.

Novel genetic susceptibility loci for diabetic end-stage renal disease identified through robust naive Bayes classification.通过稳健的朴素贝叶斯分类方法鉴定出糖尿病终末期肾病的新遗传易感性基因座。

Diabetologia. 2014 Aug;57(8):1611-22. doi: 10.1007/s00125-014-3256-2. Epub 2014 May 29.

Cancer subtype discovery and biomarker identification via a new robust network clustering algorithm.通过一种新的稳健网络聚类算法发现癌症亚型和鉴定生物标志物。

PLoS One. 2013 Jun 17;8(6):e66256. doi: 10.1371/journal.pone.0066256. Print 2013.

本文引用的文献

Algebraic comparison of partial lists in bioinformatics.生物信息学中部分列表的代数比较。

PLoS One. 2012;7(5):e36540. doi: 10.1371/journal.pone.0036540. Epub 2012 May 17.

Stable feature selection for biomarker discovery.用于生物标志物发现的稳定特征选择。

Comput Biol Chem. 2010 Aug;34(4):215-25. doi: 10.1016/j.compbiolchem.2010.07.002. Epub 2010 Aug 10.

The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models.《基因芯片质量控制（MAQC）-II 研究：基于基因芯片的预测模型的开发和验证的常见实践》。

Nat Biotechnol. 2010 Aug;28(8):827-38. doi: 10.1038/nbt.1665. Epub 2010 Jul 30.

Classification across gene expression microarray studies.基因表达微阵列研究中的分类。

BMC Bioinformatics. 2009 Dec 30;10:453. doi: 10.1186/1471-2105-10-453.

Robust biomarker identification for cancer diagnosis with ensemble feature selection methods.基于集成特征选择方法的癌症诊断稳健生物标志物识别。

Bioinformatics. 2010 Feb 1;26(3):392-8. doi: 10.1093/bioinformatics/btp630. Epub 2009 Nov 25.

Stability and aggregation of ranked gene lists.排名基因列表的稳定性和聚集性。

Brief Bioinform. 2009 Sep;10(5):556-68. doi: 10.1093/bib/bbp034.

Effects of sample size on robustness and prediction accuracy of a prognostic gene signature.样本量对预后基因特征的稳健性和预测准确性的影响。

BMC Bioinformatics. 2009 May 16;10:147. doi: 10.1186/1471-2105-10-147.

A gene network simulator to assess reverse engineering algorithms.一种用于评估逆向工程算法的基因网络模拟器。

Ann N Y Acad Sci. 2009 Mar;1158:125-42. doi: 10.1111/j.1749-6632.2008.03756.x.

Biological convergence of cancer signatures.癌症特征的生物学趋同

PLoS One. 2009;4(2):e4544. doi: 10.1371/journal.pone.0004544. Epub 2009 Feb 20.

Repeatability of published microarray gene expression analyses.已发表的微阵列基因表达分析的可重复性。

Nat Genet. 2009 Feb;41(2):149-55. doi: 10.1038/ng.295. Epub 2008 Jan 28.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

样本大小和异质性对生物标志物发现的影响：合成和真实数据评估。

Effect of size and heterogeneity of samples on biomarker discovery: synthetic and real data assessment.

机构信息

出版信息

MOTIVATION

METHODS

RESULTS

动机

方法

结果

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献