一种通过双聚类分析进行亚组识别和预测的复合模型。

A composite model for subgroup identification and prediction via bicluster analysis.

作者信息

Chen Hung-Chia, Zou Wen, Lu Tzu-Pin, Chen James J

机构信息

Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas, United States of America; Graduate Institute of Biostatistics and Biostatistics Center, China Medical University, Taichung, Taiwan.

Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas, United States of America.

出版信息

PLoS One. 2014 Oct 27;9(10):e111318. doi: 10.1371/journal.pone.0111318. eCollection 2014.

DOI:10.1371/journal.pone.0111318

PMID:25347824

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4210136/

Abstract

BACKGROUND

A major challenges in the analysis of large and complex biomedical data is to develop an approach for 1) identifying distinct subgroups in the sampled populations, 2) characterizing their relationships among subgroups, and 3) developing a prediction model to classify subgroup memberships of new samples by finding a set of predictors. Each subgroup can represent different pathogen serotypes of microorganisms, different tumor subtypes in cancer patients, or different genetic makeups of patients related to treatment response.

METHODS

This paper proposes a composite model for subgroup identification and prediction using biclusters. A biclustering technique is first used to identify a set of biclusters from the sampled data. For each bicluster, a subgroup-specific binary classifier is built to determine if a particular sample is either inside or outside the bicluster. A composite model, which consists of all binary classifiers, is constructed to classify samples into several disjoint subgroups. The proposed composite model neither depends on any specific biclustering algorithm or patterns of biclusters, nor on any classification algorithms.

RESULTS

The composite model was shown to have an overall accuracy of 97.4% for a synthetic dataset consisting of four subgroups. The model was applied to two datasets where the sample's subgroup memberships were known. The procedure showed 83.7% accuracy in discriminating lung cancer adenocarcinoma and squamous carcinoma subtypes, and was able to identify 5 serotypes and several subtypes with about 94% accuracy in a pathogen dataset.

CONCLUSION

The composite model presents a novel approach to developing a biclustering-based classification model from unlabeled sampled data. The proposed approach combines unsupervised biclustering and supervised classification techniques to classify samples into disjoint subgroups based on their associated attributes, such as genotypic factors, phenotypic outcomes, efficacy/safety measures, or responses to treatments. The procedure is useful for identification of unknown species or new biomarkers for targeted therapy.

摘要

背景

分析大型复杂生物医学数据的一个主要挑战是开发一种方法，用于：1）识别抽样人群中的不同亚组；2）描述亚组之间的关系；3）通过找到一组预测因子来开发预测模型，以对新样本的亚组成员身份进行分类。每个亚组可以代表微生物的不同病原体血清型、癌症患者的不同肿瘤亚型或与治疗反应相关的患者不同基因组成。

方法

本文提出了一种使用双聚类进行亚组识别和预测的复合模型。首先使用双聚类技术从抽样数据中识别出一组双聚类。对于每个双聚类，构建一个亚组特异性二元分类器，以确定特定样本是在双聚类内部还是外部。构建一个由所有二元分类器组成的复合模型，将样本分类为几个不相交的亚组。所提出的复合模型既不依赖于任何特定的双聚类算法或双聚类模式，也不依赖于任何分类算法。

结果

对于由四个亚组组成的合成数据集，复合模型的总体准确率为97.4%。该模型应用于两个已知样本亚组成员身份的数据集。该过程在区分肺癌腺癌和鳞癌亚型方面显示出83.7%的准确率，并且在病原体数据集中能够以约94%的准确率识别出5种血清型和几种亚型。

结论

复合模型提出了一种从未标记抽样数据开发基于双聚类的分类模型的新方法。所提出的方法结合了无监督双聚类和有监督分类技术，根据样本的相关属性（如基因型因素、表型结果、疗效/安全措施或对治疗的反应）将样本分类为不相交的亚组。该过程对于识别未知物种或靶向治疗的新生物标志物很有用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a9cc/4210136/c3467685787f/pone.0111318.g001.jpg

相似文献

A composite model for subgroup identification and prediction via bicluster analysis.

PLoS One. 2014 Oct 27;9(10):e111318. doi: 10.1371/journal.pone.0111318. eCollection 2014.

COSCEB: Comprehensive search for column-coherent evolution biclusters and its application to hub gene identification.

J Biosci. 2019 Jun;44(2).

Identification of bicluster regions in a binary matrix and its applications.

PLoS One. 2013 Aug 5;8(8):e71680. doi: 10.1371/journal.pone.0071680. Print 2013.

Pattern-driven neighborhood search for biclustering of microarray data.

BMC Bioinformatics. 2012 May 8;13 Suppl 7(Suppl 7):S11. doi: 10.1186/1471-2105-13-S7-S11.

Discovery of error-tolerant biclusters from noisy gene expression data.

BMC Bioinformatics. 2011 Nov 24;12 Suppl 12(Suppl 12):S1. doi: 10.1186/1471-2105-12-S12-S1.

An evaluation study of biclusters visualization techniques of gene expression data.

J Integr Bioinform. 2021 Oct 27;18(4):20210019. doi: 10.1515/jib-2021-0019.

Differential co-expression framework to quantify goodness of biclusters and compare biclustering algorithms.

Algorithms Mol Biol. 2010 May 28;5:23. doi: 10.1186/1748-7188-5-23.

VisBicluster: A Matrix-Based Bicluster Visualization of Expression Data.

J Comput Biol. 2020 Sep;27(9):1384-1396. doi: 10.1089/cmb.2019.0385. Epub 2020 Feb 7.

A framework for generalized subspace pattern mining in high-dimensional datasets.

BMC Bioinformatics. 2014 Nov 21;15(1):355. doi: 10.1186/s12859-014-0355-5.

Topological biclustering ARTMAP for identifying within bicluster relationships.

Neural Netw. 2023 Mar;160:34-49. doi: 10.1016/j.neunet.2022.12.010. Epub 2022 Dec 20.

本文引用的文献

Efficient Mining of Discriminative Co-clusters from Gene Expression Data.

Knowl Inf Syst. 2014 Dec;41(3):667-696. doi: 10.1007/s10115-013-0684-0.

On finding bicliques in bipartite graphs: a novel algorithm and its application to the integration of diverse biological data types.

BMC Bioinformatics. 2014 Apr 15;15:110. doi: 10.1186/1471-2105-15-110.

Exploring high dimensional data with Butterfly: a novel classification algorithm based on discrete dynamical systems.

Bioinformatics. 2014 Mar 1;30(5):712-8. doi: 10.1093/bioinformatics/btt602. Epub 2013 Oct 21.

Identification of bicluster regions in a binary matrix and its applications.

PLoS One. 2013 Aug 5;8(8):e71680. doi: 10.1371/journal.pone.0071680. Print 2013.

Pharmacogenomic biomarkers for personalized medicine.

Pharmacogenomics. 2013 Jun;14(8):969-80. doi: 10.2217/pgs.13.75.

Meta-analysis of pulsed-field gel electrophoresis fingerprints based on a constructed Salmonella database.

PLoS One. 2013;8(3):e59224. doi: 10.1371/journal.pone.0059224. Epub 2013 Mar 14.

Class-imbalanced classifiers for high-dimensional data.

Brief Bioinform. 2013 Jan;14(1):13-26. doi: 10.1093/bib/bbs006. Epub 2012 Mar 9.

Prediction system for rapid identification of Salmonella serotypes based on pulsed-field gel electrophoresis fingerprints.

J Clin Microbiol. 2012 May;50(5):1524-32. doi: 10.1128/JCM.00111-12. Epub 2012 Feb 29.

Prognostic and predictive value of a malignancy-risk gene signature in early-stage non-small cell lung cancer.

J Natl Cancer Inst. 2011 Dec 21;103(24):1859-70. doi: 10.1093/jnci/djr420. Epub 2011 Dec 8.

GeneWeaver: a web-based system for integrative functional genomics.

Nucleic Acids Res. 2012 Jan;40(Database issue):D1067-76. doi: 10.1093/nar/gkr968. Epub 2011 Nov 12.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种通过双聚类分析进行亚组识别和预测的复合模型。

A composite model for subgroup identification and prediction via bicluster analysis.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

背景

方法

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献