一种用于微阵列探针响应特性分析的白盒方法：BaFL 管道。

A white-box approach to microarray probe response characterization: the BaFL pipeline.

机构信息

Computer Science Dept, University of North Carolina at Charlotte, Charlotte, NC 28223, USA.

出版信息

BMC Bioinformatics. 2009 Dec 29;10:449. doi: 10.1186/1471-2105-10-449.

DOI:10.1186/1471-2105-10-449

PMID:20040098

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2804686/

Abstract

BACKGROUND

Microarrays depend on appropriate probe design to deliver the promise of accurate genome-wide measurement. Probe design, ideally, produces a unique probe-target match with homogeneous duplex stability over the complete set of probes. Much of microarray pre-processing is concerned with adjusting for non-ideal probes that do not report target concentration accurately. Cross-hybridizing probes (non-unique), probe composition and structure, as well as platform effects such as instrument limitations, have been shown to affect the interpretation of signal. Data cleansing pipelines seldom filter specifically for these constraints, relying instead on general statistical tests to remove the most variable probes from the samples in a study. This adjusts probes contributing to ProbeSet (gene) values in a study-specific manner. We refer to the complete set of factors as biologically applied filter levels (BaFL) and have assembled an analysis pipeline for managing them consistently. The pipeline and associated experiments reported here examine the outcome of comprehensively excluding probes affected by known factors on inter-experiment target behavior consistency.

RESULTS

We present here a 'white box' probe filtering and intensity transformation protocol that incorporates currently understood factors affecting probe and target interactions; the method has been tested on data from the Affymetrix human GeneChip HG-U95Av2, using two independent datasets from studies of a complex lung adenocarcinoma phenotype. The protocol incorporates probe-specific effects from SNPs, cross-hybridization and low heteroduplex affinity, as well as effects from scanner sensitivity, sample batches, and includes simple statistical tests for identifying unresolved biological factors leading to sample variability. Subsequent to filtering for these factors, the consistency and reliability of the remaining measurements is shown to be markedly improved.

CONCLUSIONS

The data cleansing protocol yields reproducible estimates of a given probe or ProbeSet's (gene's) relative expression that translates across datasets, allowing for credible cross-experiment comparisons. We provide supporting evidence for the validity of removing several large classes of probes, and for our approaches for removing outlying samples. The resulting expression profiles demonstrate consistency across the two independent datasets. Finally, we demonstrate that, given an appropriate sampling pool, the method enhances the t-test's statistical power to discriminate significantly different means over sample classes.

摘要

背景

微阵列依赖于适当的探针设计，以实现准确测量全基因组的承诺。理想情况下，探针设计会生成与完整探针集的同质双链体稳定性相匹配的独特探针-靶标匹配。微阵列预处理的很大一部分都涉及调整不能准确报告靶浓度的非理想探针。交叉杂交探针（非独特）、探针组成和结构以及仪器限制等平台效应已被证明会影响信号的解释。数据清洗管道很少针对这些限制进行过滤，而是依赖于一般的统计测试来从研究中的样本中去除最可变的探针。这会以特定于研究的方式调整对探针集（基因）值有贡献的探针。我们将整套因素称为生物应用过滤级别（BaFL），并组装了一个分析管道来一致地管理它们。本文报告的管道和相关实验检查了全面排除受已知因素影响的探针对实验间靶标行为一致性的影响。

结果

我们在这里提出了一种“白盒”探针过滤和强度转换协议，该协议包含当前影响探针和靶标相互作用的已知因素；该方法已在 Affymetrix 人类 GeneChip HG-U95Av2 的数据上进行了测试，使用来自复杂肺腺癌表型研究的两个独立数据集。该协议包含 SNP、交叉杂交和低异源双链体亲和力的探针特异性影响，以及扫描仪灵敏度、样本批次的影响，并包括用于识别导致样本变异性的未解决生物因素的简单统计测试。在过滤掉这些因素之后，显示剩余测量的一致性和可靠性得到了显著提高。

结论

数据清洗协议可生成给定探针或探针集（基因）相对表达的可重复估计值，可在数据集之间进行转换，从而实现可信的跨实验比较。我们为去除几大类探针提供了支持性证据，并为我们的去除异常样本的方法提供了支持性证据。得到的表达谱在两个独立数据集之间表现出一致性。最后，我们证明了，在适当的采样池的情况下，该方法增强了 t 检验在区分样本类之间显著不同均值的统计能力。