基于模型的聚类与数据校正以去除基因表达数据中的伪迹

Model-Based Clustering With Data Correction For Removing Artifacts In Gene Expression Data.

作者信息

Young William Chad, Raftery Adrian E, Yeung Ka Yee

机构信息

Department of Statistics, University of Washington, Box 354322, Seattle, WA 98195.

Institute of Technology, University of Washington Tacoma, Campus Box 358426, 1900 Commerce Street, Tacoma, WA 98402.

出版信息

Ann Appl Stat. 2016 Feb;11(4):1998-2026. doi: 10.1214/17-AOAS1051. Epub 2017 Dec 28.

DOI:10.1214/17-AOAS1051

PMID:30740193

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6364860/

Abstract

The NIH Library of Integrated Network-based Cellular Signatures (LINCS) contains gene expression data from over a million experiments, using Luminex Bead technology. Only 500 colors are used to measure the expression levels of the 1,000 landmark genes measured, and the data for the resulting pairs of genes are deconvolved. The raw data are sometimes inadequate for reliable deconvolution, leading to artifacts in the final processed data. These include the expression levels of paired genes being flipped or given the same value, and clusters of values that are not at the true expression level. We propose a new method called model-based clustering with data correction (MCDC) that is able to identify and correct these three kinds of artifacts simultaneously. We show that MCDC improves the resulting gene expression data in terms of agreement with external baselines, as well as improving results from subsequent analysis.

摘要

美国国立卫生研究院基于综合网络的细胞特征库（LINCS）包含使用Luminex微珠技术进行的超过100万次实验的基因表达数据。仅使用500种颜色来测量所检测的1000个标志性基因的表达水平，并且对所得基因对的数据进行反卷积处理。原始数据有时不足以进行可靠的反卷积，从而导致最终处理数据中出现伪像。这些伪像包括配对基因的表达水平被翻转或赋予相同的值，以及值的聚类不在真实表达水平上。我们提出了一种称为基于模型的聚类与数据校正（MCDC）的新方法，该方法能够同时识别和校正这三种伪像。我们表明，MCDC在与外部基线的一致性方面改进了所得的基因表达数据，同时也改善了后续分析的结果。

相似文献

Model-Based Clustering With Data Correction For Removing Artifacts In Gene Expression Data.基于模型的聚类与数据校正以去除基因表达数据中的伪迹

Ann Appl Stat. 2016 Feb;11(4):1998-2026. doi: 10.1214/17-AOAS1051. Epub 2017 Dec 28.

SigCom LINCS: data and metadata search engine for a million gene expression signatures.SigCom LINCS：用于百万个基因表达特征的数据集和元数据搜索引擎。

Nucleic Acids Res. 2022 Jul 5;50(W1):W697-W709. doi: 10.1093/nar/gkac328.

L1000CDS: LINCS L1000 characteristic direction signatures search engine.L1000CDS：连通性图谱L1000特征方向签名搜索引擎。

NPJ Syst Biol Appl. 2016;2:16015-. doi: 10.1038/npjsba.2016.15. Epub 2016 Aug 4.

l1kdeconv: an R package for peak calling analysis with LINCS L1000 data.l1kdeconv：一个用于使用LINCS L1000数据进行峰值检测分析的R软件包。

BMC Bioinformatics. 2017 Jul 27;18(1):356. doi: 10.1186/s12859-017-1767-9.

Transcriptional Characterization of Compounds: Lessons Learned from the Public LINCS Data.化合物的转录特征：从公共LINCS数据中学到的经验教训

Assay Drug Dev Technol. 2016 May;14(4):252-60. doi: 10.1089/adt.2016.715.

Influence of batch effect correction methods on drug induced differential gene expression profiles.批处理效应校正方法对药物诱导的差异基因表达谱的影响。

BMC Bioinformatics. 2019 Aug 22;20(1):437. doi: 10.1186/s12859-019-3028-6.

Technique of Gene Expression Profiles Extraction Based on the Complex Use of Clustering and Classification Methods.基于聚类与分类方法综合运用的基因表达谱提取技术

Diagnostics (Basel). 2020 Aug 12;10(8):584. doi: 10.3390/diagnostics10080584.

Ontological representation, integration, and analysis of LINCS cell line cells and their cellular responses.LINCS 细胞系及其细胞反应的本体表示、整合和分析。

BMC Bioinformatics. 2017 Dec 21;18(Suppl 17):556. doi: 10.1186/s12859-017-1981-5.

LINCS Data Portal 2.0: next generation access point for perturbation-response signatures.LINCS 数据门户 2.0：扰动-响应特征的新一代接入点。

Nucleic Acids Res. 2020 Jan 8;48(D1):D431-D439. doi: 10.1093/nar/gkz1023.

Technology evaluation: SAGE, Genzyme molecular oncology.技术评估：SAGE，健赞分子肿瘤学。

Curr Opin Mol Ther. 2001 Feb;3(1):85-96.

引用本文的文献

Deep learning prediction of chemical-induced dose-dependent and context-specific multiplex phenotype responses and its application to personalized alzheimer's disease drug repurposing.深度学习预测化学诱导的剂量依赖性和上下文特异性多重表型反应及其在个性化阿尔茨海默病药物再利用中的应用。

PLoS Comput Biol. 2022 Aug 11;18(8):e1010367. doi: 10.1371/journal.pcbi.1010367. eCollection 2022 Aug.

A Bayesian approach to accurate and robust signature detection on LINCS L1000 data.一种贝叶斯方法，用于在 LINCS L1000 数据上进行准确和稳健的特征检测。

Bioinformatics. 2020 May 1;36(9):2787-2795. doi: 10.1093/bioinformatics/btaa064.

Integration of Multiple Data Sources for Gene Network Inference Using Genetic Perturbation Data.利用基因扰动数据整合多数据源进行基因网络推断

J Comput Biol. 2019 Oct;26(10):1113-1129. doi: 10.1089/cmb.2019.0036. Epub 2019 Apr 22.

A review of connectivity map and computational approaches in pharmacogenomics.连通性图谱与药物基因组学计算方法研究综述。

Brief Bioinform. 2018 May 1;19(3):506-523. doi: 10.1093/bib/bbw112.

本文引用的文献

A posterior probability approach for gene regulatory network inference in genetic perturbation data.一种用于遗传扰动数据中基因调控网络推断的后验概率方法。

Math Biosci Eng. 2016 Dec 1;13(6):1241-1251. doi: 10.3934/mbe.2016041.

Drug-induced adverse events prediction with the LINCS L1000 data.利用LINCS L1000数据进行药物诱导不良事件预测

Bioinformatics. 2016 Aug 1;32(15):2338-45. doi: 10.1093/bioinformatics/btw168. Epub 2016 Apr 1.

PATTERN CLUSTERING BY MULTIVARIATE MIXTURE ANALYSIS.基于多元混合分析的模式聚类

Multivariate Behav Res. 1970 Apr 1;5(3):329-50. doi: 10.1207/s15327906mbr0503_6.

Relating Chemical Structure to Cellular Response: An Integrative Analysis of Gene Expression, Bioactivity, and Structural Data Across 11,000 Compounds.关联化学结构与细胞反应：对11000种化合物的基因表达、生物活性及结构数据的综合分析

CPT Pharmacometrics Syst Pharmacol. 2015 Oct;4(10):576-84. doi: 10.1002/psp4.12009. Epub 2015 Sep 29.

Int J Health Geogr. 2015 Sep 4;14:25. doi: 10.1186/s12942-015-0017-5.

ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering.ViVaMBC：使用基于模型的聚类方法从Illumina深度测序数据估计复杂群体中的病毒序列变异

BMC Bioinformatics. 2015 Feb 22;16:59. doi: 10.1186/s12859-015-0458-7.

Compound signature detection on LINCS L1000 big data.基于LINCS L1000大数据的复合特征检测

Mol Biosyst. 2015 Mar;11(3):714-22. doi: 10.1039/c4mb00677a. Epub 2015 Jan 22.

Large-scale integration of small molecule-induced genome-wide transcriptional responses, Kinome-wide binding affinities and cell-growth inhibition profiles reveal global trends characterizing systems-level drug action.大规模整合小分子诱导的全基因组转录反应、激酶组结合亲和力和细胞生长抑制谱，揭示了表征系统水平药物作用的全局趋势。

Front Genet. 2014 Sep 30;5:342. doi: 10.3389/fgene.2014.00342. eCollection 2014.

LINCS Canvas Browser: interactive web app to query, browse and interrogate LINCS L1000 gene expression signatures.LINCS Canvas 浏览器：交互式网络应用程序，用于查询、浏览和分析 LINCS L1000 基因表达特征。

Nucleic Acids Res. 2014 Jul;42(Web Server issue):W449-60. doi: 10.1093/nar/gku476. Epub 2014 Jun 6.

Fast Bayesian inference for gene regulatory networks using ScanBMA.使用ScanBMA对基因调控网络进行快速贝叶斯推理。

BMC Syst Biol. 2014 Apr 17;8:47. doi: 10.1186/1752-0509-8-47.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验