使用多维数据集对二类问题进行预测建模的结构化方法。

A structured approach to predictive modeling of a two-class problem using multidimensional data sets.

机构信息

Department of Preventive Medicine and Community Health, University of Texas Medical Branch (UTMB), Galveston, TX, USA.

出版信息

Methods. 2013 May 15;61(1):73-85. doi: 10.1016/j.ymeth.2013.01.002. Epub 2013 Jan 12.

DOI:10.1016/j.ymeth.2013.01.002

PMID:23321025

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3661737/

Abstract

Biological experiments in the post-genome era can generate a staggering amount of complex data that challenges experimentalists to extract meaningful information. Increasingly, the success of an appropriately controlled experiment relies on a robust data analysis pipeline. In this paper, we present a structured approach to the analysis of multidimensional data that relies on a close, two-way communication between the bioinformatician and experimentalist. A sequential approach employing data exploration (visualization, graphical and analytical study), pre-processing, feature reduction and supervised classification using machine learning is presented. This standardized approach is illustrated by an example from a proteomic data analysis that has been used to predict the risk of infectious disease outcome. Strategies for model selection and post hoc model diagnostics are presented and applied to the case illustration. We discuss some of the practical lessons we have learned applying supervised classification to multidimensional data sets, one of which is the importance of feature reduction in achieving optimal modeling performance.

摘要

在后基因组时代的生物学实验中，可以产生大量复杂的数据，这使得实验人员难以从中提取有意义的信息。越来越多的是，一个适当控制的实验的成功依赖于一个强大的数据分析管道。在本文中，我们提出了一种分析多维数据的结构化方法，该方法依赖于生物信息学家和实验人员之间的紧密、双向沟通。本文提出了一种采用数据探索（可视化、图形和分析研究）、预处理、特征减少和使用机器学习进行监督分类的顺序方法。通过一个已经用于预测传染病结果风险的蛋白质组学数据分析示例来说明这种标准化方法。本文还提出并应用了模型选择和事后模型诊断策略来进行案例说明。我们讨论了在将监督分类应用于多维数据集时我们学到的一些实际经验，其中之一是在实现最佳建模性能时特征减少的重要性。

相似文献

A structured approach to predictive modeling of a two-class problem using multidimensional data sets.

Methods. 2013 May 15;61(1):73-85. doi: 10.1016/j.ymeth.2013.01.002. Epub 2013 Jan 12.

Feature selection and nearest centroid classification for protein mass spectrometry.

BMC Bioinformatics. 2005 Mar 23;6:68. doi: 10.1186/1471-2105-6-68.

A comparison of methods for classifying clinical samples based on proteomics data: a case study for statistical and machine learning approaches.

PLoS One. 2011;6(9):e24973. doi: 10.1371/journal.pone.0024973. Epub 2011 Sep 28.

Bayesian supervised dimensionality reduction.

IEEE Trans Cybern. 2013 Dec;43(6):2179-89. doi: 10.1109/TCYB.2013.2245321.

Nonlinear Dimensionality Reduction by Minimum Curvilinearity for Unsupervised Discovery of Patterns in Multidimensional Proteomic Data.

Methods Mol Biol. 2016;1384:289-98. doi: 10.1007/978-1-4939-3255-9_16.

Discovery proteomics and nonparametric modeling pipeline in the development of a candidate biomarker panel for dengue hemorrhagic fever.

Clin Transl Sci. 2012 Feb;5(1):8-20. doi: 10.1111/j.1752-8062.2011.00377.x. Epub 2012 Feb 23.

Supervised nonlinear dimensionality reduction for visualization and classification.

IEEE Trans Syst Man Cybern B Cybern. 2005 Dec;35(6):1098-107. doi: 10.1109/tsmcb.2005.850151.

Automatic platelets counter for supporting dengue case detection in primary health care in indonesia.

Stud Health Technol Inform. 2013;192:585-8.

Canonical correlation analysis for multilabel classification: a least-squares formulation, extensions, and analysis.

IEEE Trans Pattern Anal Mach Intell. 2011 Jan;33(1):194-200. doi: 10.1109/TPAMI.2010.160.

Variable selection methods for developing a biomarker panel for prediction of dengue hemorrhagic fever.

BMC Res Notes. 2013 Sep 11;6:365. doi: 10.1186/1756-0500-6-365.

引用本文的文献

Meteorological factors cannot be ignored in machine learning-based methods for predicting dengue, a systematic review.

Int J Biometeorol. 2024 Mar;68(3):401-410. doi: 10.1007/s00484-023-02605-1. Epub 2023 Dec 27.

Behavioral and neurocognitive factors distinguishing post-traumatic stress comorbidity in substance use disorders.

Transl Psychiatry. 2023 Sep 14;13(1):296. doi: 10.1038/s41398-023-02591-3.

A machine learning-based approach to determine infection status in recipients of BBV152 (Covaxin) whole-virion inactivated SARS-CoV-2 vaccine for serological surveys.

Comput Biol Med. 2022 Jul;146:105419. doi: 10.1016/j.compbiomed.2022.105419. Epub 2022 Apr 25.

Cell-Based Chemical Safety Assessment and Therapeutic Discovery Using Array-Based Sensors.

Int J Mol Sci. 2022 Mar 27;23(7):3672. doi: 10.3390/ijms23073672.

Lessons and tips for designing a machine learning study using EHR data.

J Clin Transl Sci. 2020 Jul 24;5(1):e21. doi: 10.1017/cts.2020.513.

Novel statistical approaches to identify risk factors for soil-transmitted helminth infection in Timor-Leste.

Int J Parasitol. 2021 Aug;51(9):729-739. doi: 10.1016/j.ijpara.2021.01.005. Epub 2021 Mar 31.

Establishment and evaluation of prediction model for multiple disease classification based on gut microbial data.

Sci Rep. 2019 Jul 15;9(1):10189. doi: 10.1038/s41598-019-46249-x.

Development of a Multivariate Predictive Model to Estimate Ionized Calcium Concentration from Serum Biochemical Profile Results in Dogs.

J Vet Intern Med. 2017 Sep;31(5):1392-1402. doi: 10.1111/jvim.14800. Epub 2017 Aug 20.

Improved Detection of Invasive Pulmonary Aspergillosis Arising during Leukemia Treatment Using a Panel of Host Response Proteins and Fungal Antigens.

PLoS One. 2015 Nov 18;10(11):e0143165. doi: 10.1371/journal.pone.0143165. eCollection 2015.

Targeted proteomics for biomarker discovery and validation of hepatocellular carcinoma in hepatitis C infected patients.

World J Hepatol. 2015 Jun 8;7(10):1312-24. doi: 10.4254/wjh.v7.i10.1312.

本文引用的文献

Discovery proteomics and nonparametric modeling pipeline in the development of a candidate biomarker panel for dengue hemorrhagic fever.

Clin Transl Sci. 2012 Feb;5(1):8-20. doi: 10.1111/j.1752-8062.2011.00377.x. Epub 2012 Feb 23.

A three-component biomarker panel for prediction of dengue hemorrhagic fever.

Am J Trop Med Hyg. 2012 Feb;86(2):341-8. doi: 10.4269/ajtmh.2012.11-0469.

Learning from our GWAS mistakes: from experimental design to scientific method.

Biostatistics. 2012 Apr;13(2):195-203. doi: 10.1093/biostatistics/kxr055. Epub 2012 Jan 27.

What information should be required to support clinical "omics" publications?

Clin Chem. 2011 May;57(5):688-90. doi: 10.1373/clinchem.2010.158618.

More is less: signal processing and the data deluge.

Science. 2011 Feb 11;331(6018):717-9. doi: 10.1126/science.1197448.

Predicting intermediate phenotypes in asthma using bronchoalveolar lavage-derived cytokines.

Clin Transl Sci. 2010 Aug;3(4):147-57. doi: 10.1111/j.1752-8062.2010.00204.x.

Arboviral etiologies of acute febrile illnesses in Western South America, 2000-2007.

PLoS Negl Trop Dis. 2010 Aug 10;4(8):e787. doi: 10.1371/journal.pntd.0000787.

Tree and spline based association analysis of gene-gene interaction models for ischemic stroke.

Stat Med. 2004 May 15;23(9):1439-53. doi: 10.1002/sim.1749.

Missing value estimation methods for DNA microarrays.

Bioinformatics. 2001 Jun;17(6):520-5. doi: 10.1093/bioinformatics/17.6.520.

Significance analysis of microarrays applied to the ionizing radiation response.

Proc Natl Acad Sci U S A. 2001 Apr 24;98(9):5116-21. doi: 10.1073/pnas.091062498. Epub 2001 Apr 17.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用多维数据集对二类问题进行预测建模的结构化方法。

A structured approach to predictive modeling of a two-class problem using multidimensional data sets.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献