一种使用数据集成评估标识符映射和过滤方法的决策理论范式。

A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration.

机构信息

Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA.

出版信息

BMC Bioinformatics. 2013 Jul 15;14:223. doi: 10.1186/1471-2105-14-223.

DOI:10.1186/1471-2105-14-223

PMID:23855655

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3734162/

Abstract

BACKGROUND

In bioinformatics, we pre-process raw data into a format ready for answering medical and biological questions. A key step in processing is labeling the measured features with the identities of the molecules purportedly assayed: "molecular identification" (MI). Biological meaning comes from identifying these molecular measurements correctly with actual molecular species. But MI can be incorrect. Identifier filtering (IDF) selects features with more trusted MI, leaving a smaller, but more correct dataset. Identifier mapping (IDM) is needed when an analyst is combining two high-throughput (HT) measurement platforms on the same samples. IDM produces ID pairs, one ID from each platform, where the mapping declares that the two analytes are associated through a causal path, direct or indirect (example: pairing an ID for an mRNA species with an ID for a protein species that is its putative translation). Many competing solutions for IDF and IDM exist. Analysts need a rigorous method for evaluating and comparing all these choices.

RESULTS

We describe a paradigm for critically evaluating and comparing IDF and IDM methods, guided by data on biological samples. The requirements are: a large set of biological samples, measurements on those samples from at least two high-throughput platforms, a model family connecting features from the platforms, and an association measure. From these ingredients, one fits a mixture model coupled to a decision framework. We demonstrate this evaluation paradigm in three settings: comparing performance of several bioinformatics resources for IDM between transcripts and proteins, comparing several published microarray probeset IDF methods and their combinations, and selecting optimal quality thresholds for tandem mass spectrometry spectral events.

CONCLUSIONS

The paradigm outlined here provides a data-grounded approach for evaluating the quality not just of IDM and IDF, but of any pre-processing step or pipeline. The results will help researchers to semantically integrate or filter data optimally, and help bioinformatics database curators to track changes in quality over time and even to troubleshoot causes of MI errors.

摘要

背景

在生物信息学中，我们将原始数据预处理为可回答医学和生物学问题的格式。处理的关键步骤是用据称检测的分子的身份标记测量特征：“分子鉴定”（MI）。生物学意义来自于正确识别这些分子测量值与实际分子物种。但是 MI 可能是不正确的。标识符过滤（IDF）选择具有更可信 MI 的特征，留下更小但更正确的数据集。当分析人员将同一样本上的两个高通量（HT）测量平台组合在一起时，需要进行标识符映射（IDM）。IDM 生成 ID 对，每个平台一个 ID，映射声明两个分析物通过直接或间接的因果路径相关（例如：将 mRNA 物种的 ID 与假定其翻译的蛋白质物种的 ID 配对）。存在许多用于 IDF 和 IDM 的竞争解决方案。分析人员需要一种严格的方法来评估和比较所有这些选择。

结果

我们描述了一种批判性评估和比较 IDF 和 IDM 方法的范例，该范例由生物样本的数据指导。要求包括：大量生物样本、来自至少两个高通量平台的样本测量值、连接平台特征的模型家族以及关联度量。从这些成分中，一个人拟合了一个混合模型，该模型与决策框架耦合。我们在三个设置中演示了这种评估范例：比较转录本和蛋白质之间几种生物信息学资源的 IDM 性能、比较几种已发表的微阵列探针集 IDF 方法及其组合、以及为串联质谱光谱事件选择最佳质量阈值。

结论

这里概述的范例提供了一种基于数据的方法，不仅可以评估 IDM 和 IDF 的质量，还可以评估任何预处理步骤或管道的质量。结果将帮助研究人员优化数据语义集成或过滤，帮助生物信息学数据库管理员跟踪质量随时间的变化，甚至可以解决 MI 错误的原因。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/253c/3734162/bab2a960ff0f/1471-2105-14-223-1.jpg

相似文献

A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration.

BMC Bioinformatics. 2013 Jul 15;14:223. doi: 10.1186/1471-2105-14-223.

Identifier mapping performance for integrating transcriptomics and proteomics experimental results.

BMC Bioinformatics. 2011 May 27;12:213. doi: 10.1186/1471-2105-12-213.

Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering.

Cancer Inform. 2015 Dec 16;14:149-61. doi: 10.4137/CIN.S33076. eCollection 2015.

Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

High Level of Integration in Integrated Disease Management Leads to Higher Usage in the e-Vita Study: Self-Management of Chronic Obstructive Pulmonary Disease With Web-Based Platforms in a Parallel Cohort Design.

J Med Internet Res. 2017 May 31;19(5):e185. doi: 10.2196/jmir.7037.

Integrated miRNA, mRNA and protein expression analysis reveals the role of post-transcriptional regulation in controlling CHO cell growth rate.

BMC Genomics. 2012 Nov 21;13:656. doi: 10.1186/1471-2164-13-656.

In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics.

J Proteomics. 2017 Jan 6;150:170-182. doi: 10.1016/j.jprot.2016.08.002. Epub 2016 Aug 4.

The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services.

BMC Bioinformatics. 2010 Jan 4;11:5. doi: 10.1186/1471-2105-11-5.

Rationale and study design of the CardioGene Study: genomics of in-stent restenosis.

Pharmacogenomics. 2004 Oct;5(7):952-1004. doi: 10.1517/14622416.5.7.949.

引用本文的文献

Bioinformatics Analysis Identifies Lipid Droplet-Associated Gene Signatures as Promising Prognostic and Diagnostic Models for Endometrial Cancer.

Cancer Rep (Hoboken). 2025 Aug;8(8):e70313. doi: 10.1002/cnr2.70313.

UNC93B1: a novel immune-related prognostic biomarker in breast cancer.

Discov Oncol. 2025 Jul 17;16(1):1352. doi: 10.1007/s12672-025-03124-8.

J Cancer. 2025 May 8;16(8):2516-2536. doi: 10.7150/jca.104826. eCollection 2025.

The Prognostic Value and Immunomodulatory Role of Spsb2, a Novel Immune Checkpoint Molecule, in Hepatocellular Carcinoma.

Genes (Basel). 2025 Mar 17;16(3):346. doi: 10.3390/genes16030346.

Bioinformatics Based Drug Repurposing Approach for Breast and Gynecological Cancers: Genes Address Common Hub Genes and Drugs.

Eur J Breast Health. 2025 Jan 1;21(1):63-73. doi: 10.4274/ejbh.galenos.2024.2024-11-2.

CPA4 as a biomarker promotes the proliferation, migration and metastasis of clear cell renal cell carcinoma cells.

J Cell Mol Med. 2024 Apr;28(7):e18165. doi: 10.1111/jcmm.18165.

A comprehensive analysis and experimental validation of TK1 in uterine corpus endometrial carcinoma.

Sci Rep. 2024 Mar 13;14(1):6134. doi: 10.1038/s41598-024-56676-0.

TGFA expression is associated with poor prognosis and promotes the development of cervical cancer.

J Cell Mol Med. 2024 Feb;28(3):e18086. doi: 10.1111/jcmm.18086. Epub 2023 Dec 28.

CD72, a new immune checkpoint molecule, is a novel prognostic biomarker for kidney renal clear cell carcinoma.

Eur J Med Res. 2023 Nov 18;28(1):531. doi: 10.1186/s40001-023-01487-8.

RNA 5-Methylcytosine regulators are associated with cell adhesion and predict prognosis of endometrial cancer.

Transl Cancer Res. 2023 Oct 31;12(10):2556-2571. doi: 10.21037/tcr-23-742. Epub 2023 Oct 24.

本文引用的文献

RNA-seq and microarray complement each other in transcriptome profiling.

BMC Genomics. 2012 Nov 15;13:629. doi: 10.1186/1471-2164-13-629.

AbsIDconvert: an absolute approach for converting genetic identifiers at different granularities.

BMC Bioinformatics. 2012 Sep 12;13:229. doi: 10.1186/1471-2105-13-229.

Integrative analyses for omics data: a Bayesian mixture model to assess the concordance of ChIP-chip and ChIP-seq measurements.

J Toxicol Environ Health A. 2012;75(8-10):461-70. doi: 10.1080/15287394.2012.674914.

Jetset: selecting the optimal microarray probe set to represent a gene.

BMC Bioinformatics. 2011 Dec 15;12:474. doi: 10.1186/1471-2105-12-474.

Identifier mapping performance for integrating transcriptomics and proteomics experimental results.

BMC Bioinformatics. 2011 May 27;12:213. doi: 10.1186/1471-2105-12-213.

The proteogenomic mapping tool.

BMC Bioinformatics. 2011 Apr 22;12:115. doi: 10.1186/1471-2105-12-115.

Discovering functional modules by identifying recurrent and mutually exclusive mutational patterns in tumors.

BMC Med Genomics. 2011 Apr 14;4:34. doi: 10.1186/1755-8794-4-34.

Proteomic analysis of stage I endometrial cancer tissue: identification of proteins associated with oxidative processes and inflammation.

Gynecol Oncol. 2011 Jun 1;121(3):586-94. doi: 10.1016/j.ygyno.2011.02.031. Epub 2011 Apr 1.

Development of a cross-platform biomarker signature to detect renal transplant tolerance in humans.

J Clin Invest. 2010 Jun;120(6):1848-61. doi: 10.1172/JCI39922. Epub 2010 May 24.

Probe set filtering increases correlation between Affymetrix GeneChip and qRT-PCR expression measurements.

BMC Bioinformatics. 2010 Feb 24;11:104. doi: 10.1186/1471-2105-11-104.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种使用数据集成评估标识符映射和过滤方法的决策理论范式。

A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献