当前基于机器学习的生物医学数据分析中的子群检测的预测方法——诱导偏差。

Current Projection Methods-Induced Biases at Subgroup Detection for Machine-Learning Based Data-Analysis of Biomedical Data.

机构信息

Institute of Clinical Pharmacology, Goethe-University, Theodor-Stern-Kai 7, 60590 Frankfurt am Main, Germany.

Fraunhofer Institute for Molecular Biology and Applied Ecology IME, Project Group Translational Medicine and Pharmacology TMP, Theodor-Stern-Kai 7, 60590 Frankfurt am Main, Germany.

出版信息

Int J Mol Sci. 2019 Dec 20;21(1):79. doi: 10.3390/ijms21010079.

DOI:10.3390/ijms21010079

PMID:31861946

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6982269/

Abstract

Advances in flow cytometry enable the acquisition of large and high-dimensional data sets per patient. Novel computational techniques allow the visualization of structures in these data and, finally, the identification of relevant subgroups. Correct data visualizations and projections from the high-dimensional space to the visualization plane require the correct representation of the structures in the data. This work shows that frequently used techniques are unreliable in this respect. One of the most important methods for data projection in this area is the t-distributed stochastic neighbor embedding (t-SNE). We analyzed its performance on artificial and real biomedical data sets. t-SNE introduced a cluster structure for homogeneously distributed data that did not contain any subgroup structure. In other data sets, t-SNE occasionally suggested the wrong number of subgroups or projected data points belonging to different subgroups, as if belonging to the same subgroup. As an alternative approach, emergent self-organizing maps (ESOM) were used in combination with U-matrix methods. This approach allowed the correct identification of homogeneous data while in sets containing distance or density-based subgroups structures; the number of subgroups and data point assignments were correctly displayed. The results highlight possible pitfalls in the use of a currently widely applied algorithmic technique for the detection of subgroups in high dimensional cytometric data and suggest a robust alternative.

摘要

流式细胞术的进步使得每个患者都能够获取大型的高维数据集。新颖的计算技术允许对这些数据中的结构进行可视化，最终确定相关的亚组。要将高维空间中的数据正确地可视化和投影到可视化平面上，需要正确表示数据中的结构。本研究表明，目前常用的技术在这方面不可靠。该领域中用于数据投影的最重要方法之一是 t 分布随机邻域嵌入（t-SNE）。我们分析了其在人工和真实生物医学数据集上的性能。t-SNE 为同质分布数据引入了聚类结构，但这些数据中不包含任何亚组结构。在其他数据集上，t-SNE 偶尔会错误地提示亚组的数量，或者将属于不同亚组的数据点投影到同一亚组中。作为替代方法，我们使用了新兴的自组织映射（ESOM）与 U 矩阵方法相结合。这种方法允许在同质数据集中正确识别，而在包含基于距离或密度的亚组结构的集中，可以正确显示亚组的数量和数据点分配。结果突出了在使用当前广泛应用的算法技术检测高维流式细胞术数据中的亚组时可能存在的陷阱，并提出了一种稳健的替代方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad5b/6982269/dce2872fa8a8/ijms-21-00079-g004.jpg

相似文献

Current Projection Methods-Induced Biases at Subgroup Detection for Machine-Learning Based Data-Analysis of Biomedical Data.

Int J Mol Sci. 2019 Dec 20;21(1):79. doi: 10.3390/ijms21010079.

Machine-learned cluster identification in high-dimensional data.

J Biomed Inform. 2017 Feb;66:95-104. doi: 10.1016/j.jbi.2016.12.011. Epub 2016 Dec 28.

Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets.

Nat Commun. 2019 Nov 28;10(1):5415. doi: 10.1038/s41467-019-13055-y.

A theorem proving approach for automatically synthesizing visualizations of flow cytometry data.

BMC Bioinformatics. 2017 Jun 7;18(Suppl 8):245. doi: 10.1186/s12859-017-1662-4.

Self-Organizing Nebulous Growths for Robust and Incremental Data Visualization.

IEEE Trans Neural Netw Learn Syst. 2021 Oct;32(10):4588-4602. doi: 10.1109/TNNLS.2020.3023941. Epub 2021 Oct 5.

Categorical Analysis of Human T Cell Heterogeneity with One-Dimensional Soli-Expression by Nonlinear Stochastic Embedding.

J Immunol. 2016 Jan 15;196(2):924-32. doi: 10.4049/jimmunol.1501928. Epub 2015 Dec 14.

On the Use of -Distributed Stochastic Neighbor Embedding for Data Visualization and Classification of Individuals with Parkinson's Disease.

Comput Math Methods Med. 2018 Nov 4;2018:8019232. doi: 10.1155/2018/8019232. eCollection 2018.

Identification of stem cells from large cell populations with topological scoring.

Mol Omics. 2021 Feb 1;17(1):59-65. doi: 10.1039/d0mo00039f. Epub 2020 Sep 14.

Analyzing the similarity of samples and genes by MG-PCC algorithm, t-SNE-SS and t-SNE-SG maps.

BMC Bioinformatics. 2018 Dec 17;19(1):512. doi: 10.1186/s12859-018-2495-5.

The art of using t-SNE for single-cell transcriptomics.

Nat Commun. 2019 Nov 28;10(1):5416. doi: 10.1038/s41467-019-13056-x.

引用本文的文献

Machine learning and biological validation identify sphingolipids as potential mediators of paclitaxel-induced neuropathy in cancer patients.

Elife. 2024 Sep 30;13:RP91941. doi: 10.7554/eLife.91941.

Logistic PCA explains differences between genome-scale metabolic models in terms of metabolic pathways.

PLoS Comput Biol. 2024 Jun 24;20(6):e1012236. doi: 10.1371/journal.pcbi.1012236. eCollection 2024 Jun.

Artificial intelligence and machine learning in pain research: a data scientometric analysis.

Pain Rep. 2022 Nov 3;7(6):e1044. doi: 10.1097/PR9.0000000000001044. eCollection 2022 Nov-Dec.

Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans).

BMC Bioinformatics. 2022 Jun 16;23(1):233. doi: 10.1186/s12859-022-04769-w.

Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling).

PLoS One. 2021 Aug 5;16(8):e0255838. doi: 10.1371/journal.pone.0255838. eCollection 2021.

Analyzing the fine structure of distributions.

PLoS One. 2020 Oct 14;15(10):e0238835. doi: 10.1371/journal.pone.0238835. eCollection 2020.

Multiparametric Color Tendency Analysis (MCTA): A Method to Analyze Several Flow Cytometry Labelings Simultaneously.

Front Bioeng Biotechnol. 2020 Sep 17;8:526814. doi: 10.3389/fbioe.2020.526814. eCollection 2020.

Gestational Dysfunction-Driven Diets and Probiotic Supplementation Correlate with the Profile of Allergen-Specific Antibodies in the Serum of Allergy Sufferers.

Nutrients. 2020 Aug 9;12(8):2381. doi: 10.3390/nu12082381.

本文引用的文献

Normative data for flow cytometry immunophenotyping of benign lymph nodes sampled by surgical biopsy.

J Clin Pathol. 2018 Feb;71(2):174-179. doi: 10.1136/jclinpath-2017-204687. Epub 2017 Sep 15.

Machine-learned cluster identification in high-dimensional data.

J Biomed Inform. 2017 Feb;66:95-104. doi: 10.1016/j.jbi.2016.12.011. Epub 2016 Dec 28.

Computational flow cytometry: helping to make sense of high-dimensional immunology data.

Nat Rev Immunol. 2016 Jul;16(7):449-62. doi: 10.1038/nri.2016.56. Epub 2016 Jun 20.

FlowSOM: Using self-organizing maps for visualization and interpretation of cytometry data.

Cytometry A. 2015 Jul;87(7):636-45. doi: 10.1002/cyto.a.22625. Epub 2015 Jan 8.

A machine-learned knowledge discovery method for associating complex phenotypes with complex genotypes. Application to pain.

J Biomed Inform. 2013 Oct;46(5):921-8. doi: 10.1016/j.jbi.2013.07.010. Epub 2013 Jul 27.

Local multidimensional scaling.

Neural Netw. 2006 Jul-Aug;19(6-7):889-99. doi: 10.1016/j.neunet.2006.05.014. Epub 2006 Jun 19.

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

Science. 1999 Oct 15;286(5439):531-7. doi: 10.1126/science.286.5439.531.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

当前基于机器学习的生物医学数据分析中的子群检测的预测方法——诱导偏差。

Current Projection Methods-Induced Biases at Subgroup Detection for Machine-Learning Based Data-Analysis of Biomedical Data.

机构信息

Institute of Clinical Pharmacology, Goethe-University, Theodor-Stern-Kai 7, 60590 Frankfurt am Main, Germany.

Fraunhofer Institute for Molecular Biology and Applied Ecology IME, Project Group Translational Medicine and Pharmacology TMP, Theodor-Stern-Kai 7, 60590 Frankfurt am Main, Germany.

出版信息

Int J Mol Sci. 2019 Dec 20;21(1):79. doi: 10.3390/ijms21010079.

DOI:10.3390/ijms21010079

PMID:31861946

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6982269/

Abstract

摘要

当前基于机器学习的生物医学数据分析中的子群检测的预测方法——诱导偏差。

Current Projection Methods-Induced Biases at Subgroup Detection for Machine-Learning Based Data-Analysis of Biomedical Data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

当前基于机器学习的生物医学数据分析中的子群检测的预测方法——诱导偏差。

Current Projection Methods-Induced Biases at Subgroup Detection for Machine-Learning Based Data-Analysis of Biomedical Data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献