McDade Kevin K, Chandran Uma, Day Roger S
Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA. ; Department of Science, The Pennsylvania State University, Shenango Campus, Sharon, PA, USA.
Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
Cancer Inform. 2015 Dec 16;14:149-61. doi: 10.4137/CIN.S33076. eCollection 2015.
Data quality is a recognized problem for high-throughput genomics platforms, as evinced by the proliferation of methods attempting to filter out lower quality data points. Different filtering methods lead to discordant results, raising the question, which methods are best? Astonishingly, little computational support is offered to analysts to decide which filtering methods are optimal for the research question at hand. To evaluate them, we begin with a pair of expression data sets, transcriptomic and proteomic, on the same samples. The pair of data sets form a test-bed for the evaluation. Identifier mapping between the data sets creates a collection of feature pairs, with correlations calculated for each pair. To evaluate a filtering strategy, we estimate posterior probabilities for the correctness of probesets accepted by the method. An analyst can set expected utilities that represent the trade-off between the quality and quantity of accepted features. We tested nine published probeset filtering methods and combination strategies. We used two test-beds from cancer studies providing transcriptomic and proteomic data. For reasonable utility settings, the Jetset filtering method was optimal for probeset filtering on both test-beds, even though both assay platforms were different. Further intersection with a second filtering method was indicated on one test-bed but not the other.
数据质量是高通量基因组学平台公认的问题,试图过滤掉低质量数据点的方法激增就证明了这一点。不同的过滤方法会导致不一致的结果,这就引出了一个问题:哪种方法是最好的?令人惊讶的是,几乎没有为分析师提供计算支持,以决定哪种过滤方法最适合手头的研究问题。为了评估这些方法,我们从同一组样本的一对表达数据集(转录组学和蛋白质组学)开始。这对数据集构成了评估的试验台。数据集之间的标识符映射创建了一组特征对,并为每对计算相关性。为了评估一种过滤策略,我们估计该方法接受的探针集正确性的后验概率。分析师可以设置预期效用,以表示接受特征的质量和数量之间的权衡。我们测试了九种已发表的探针集过滤方法和组合策略。我们使用了来自癌症研究的两个试验台,提供转录组学和蛋白质组学数据。对于合理的效用设置,即使两个检测平台不同,Jetset过滤方法在两个试验台上进行探针集过滤时都是最优的。在一个试验台上表明需要与第二种过滤方法进一步交叉,但在另一个试验台上则不需要。