Suppr超能文献

一种使用数据集成评估标识符映射和过滤方法的决策理论范式。

A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration.

机构信息

Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA.

出版信息

BMC Bioinformatics. 2013 Jul 15;14:223. doi: 10.1186/1471-2105-14-223.

Abstract

BACKGROUND

In bioinformatics, we pre-process raw data into a format ready for answering medical and biological questions. A key step in processing is labeling the measured features with the identities of the molecules purportedly assayed: "molecular identification" (MI). Biological meaning comes from identifying these molecular measurements correctly with actual molecular species. But MI can be incorrect. Identifier filtering (IDF) selects features with more trusted MI, leaving a smaller, but more correct dataset. Identifier mapping (IDM) is needed when an analyst is combining two high-throughput (HT) measurement platforms on the same samples. IDM produces ID pairs, one ID from each platform, where the mapping declares that the two analytes are associated through a causal path, direct or indirect (example: pairing an ID for an mRNA species with an ID for a protein species that is its putative translation). Many competing solutions for IDF and IDM exist. Analysts need a rigorous method for evaluating and comparing all these choices.

RESULTS

We describe a paradigm for critically evaluating and comparing IDF and IDM methods, guided by data on biological samples. The requirements are: a large set of biological samples, measurements on those samples from at least two high-throughput platforms, a model family connecting features from the platforms, and an association measure. From these ingredients, one fits a mixture model coupled to a decision framework. We demonstrate this evaluation paradigm in three settings: comparing performance of several bioinformatics resources for IDM between transcripts and proteins, comparing several published microarray probeset IDF methods and their combinations, and selecting optimal quality thresholds for tandem mass spectrometry spectral events.

CONCLUSIONS

The paradigm outlined here provides a data-grounded approach for evaluating the quality not just of IDM and IDF, but of any pre-processing step or pipeline. The results will help researchers to semantically integrate or filter data optimally, and help bioinformatics database curators to track changes in quality over time and even to troubleshoot causes of MI errors.

摘要

背景

在生物信息学中,我们将原始数据预处理为可回答医学和生物学问题的格式。处理的关键步骤是用据称检测的分子的身份标记测量特征:“分子鉴定”(MI)。生物学意义来自于正确识别这些分子测量值与实际分子物种。但是 MI 可能是不正确的。标识符过滤(IDF)选择具有更可信 MI 的特征,留下更小但更正确的数据集。当分析人员将同一样本上的两个高通量(HT)测量平台组合在一起时,需要进行标识符映射(IDM)。IDM 生成 ID 对,每个平台一个 ID,映射声明两个分析物通过直接或间接的因果路径相关(例如:将 mRNA 物种的 ID 与假定其翻译的蛋白质物种的 ID 配对)。存在许多用于 IDF 和 IDM 的竞争解决方案。分析人员需要一种严格的方法来评估和比较所有这些选择。

结果

我们描述了一种批判性评估和比较 IDF 和 IDM 方法的范例,该范例由生物样本的数据指导。要求包括:大量生物样本、来自至少两个高通量平台的样本测量值、连接平台特征的模型家族以及关联度量。从这些成分中,一个人拟合了一个混合模型,该模型与决策框架耦合。我们在三个设置中演示了这种评估范例:比较转录本和蛋白质之间几种生物信息学资源的 IDM 性能、比较几种已发表的微阵列探针集 IDF 方法及其组合、以及为串联质谱光谱事件选择最佳质量阈值。

结论

这里概述的范例提供了一种基于数据的方法,不仅可以评估 IDM 和 IDF 的质量,还可以评估任何预处理步骤或管道的质量。结果将帮助研究人员优化数据语义集成或过滤,帮助生物信息学数据库管理员跟踪质量随时间的变化,甚至可以解决 MI 错误的原因。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/253c/3734162/bab2a960ff0f/1471-2105-14-223-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验