查询基因组数据库：完善连通性图谱。

Querying genomic databases: refining the connectivity map.

作者信息

Segal Mark R, Xiong Hao, Bengtsson Henrik, Bourgon Richard, Gentleman Robert

机构信息

University of California-San Francisco, CA, USA.

出版信息

Stat Appl Genet Mol Biol. 2012 Jan 6;11(2):/j/sagmb.2012.11.issue-2/1544-6115.1715/1544-6115.1715.xml. doi: 10.2202/1544-6115.1715.

DOI:10.2202/1544-6115.1715

PMID:22499690

Abstract

The advent of high-throughput biotechnologies, which can efficiently measure gene expression on a global basis, has led to the creation and population of correspondingly rich databases and compendia. Such repositories have the potential to add enormous scientific value beyond that provided by individual studies which, due largely to cost considerations, are typified by small sample sizes. Accordingly, substantial effort has been invested in devising analysis schemes for utilizing gene-expression repositories. Here, we focus on one such scheme, the Connectivity Map (cmap), that was developed with the express purpose of identifying drugs with putative efficacy against a given disease, where the disease in question is characterized by a (differential) gene-expression signature. Initial claims surrounding cmap intimated that such tools might lead to new, previously unanticipated applications of existing drugs. However, further application suggests that its primary utility is in connecting a disease condition whose biology is largely unknown to a drug whose mechanisms of action are well understood, making cmap a tool for enhancing biological knowledge.The success of the Connectivity Map is belied by its simplicity. The aforementioned signature serves as an unordered query which is applied to a customized database of (differential) gene-expression experiments designed to elicit response to a wide range of drugs, across of spectrum of concentrations, durations, and cell lines. Such application is effected by computing a per experiment score that measures "closeness" between the signature and the experiment. Top-scoring experiments, and the attendant drug(s), are then deemed relevant to the disease underlying the query. Inference supporting such elicitations is pursued via re-sampling. In this paper, we revisit two key aspects of the Connectivity Map implementation. Firstly, we develop new approaches to measuring closeness for the common scenario wherein the query constitutes an ordered list. These involve using metrics proposed for analyzing partially ranked data, these being of interest in their own right and not widely used. Secondly, we advance an alternate inferential approach based on generating empirical null distributions that exploit the scope, and capture dependencies, embodied by the database. Using these refinements we undertake a comprehensive re-evaluation of Connectivity Map findings that, in general terms, reveal that accommodating ordered queries is less critical than the mode of inference.

摘要

高通量生物技术的出现能够在全球范围内有效地测量基因表达，这促使了相应丰富的数据库和纲要的创建与填充。这些储存库有可能带来巨大的科学价值，其价值远超个体研究，因为个体研究主要出于成本考虑，样本量通常较小。因此，人们投入了大量精力来设计利用基因表达储存库的分析方案。在此，我们聚焦于一种这样的方案，即连接图谱（cmap），它的开发目的明确，是为了识别对特定疾病可能有效的药物，其中所讨论的疾病由（差异）基因表达特征所表征。围绕连接图谱最初的说法暗示，此类工具可能会带来现有药物新的、此前未预料到的应用。然而，进一步的应用表明，其主要用途在于将一种生物学特性 largely 未知的疾病状况与一种作用机制已被充分理解的药物联系起来，使连接图谱成为增强生物学知识的一种工具。

连接图谱的成功与其简单性形成反差。上述特征用作一个无序查询，应用于一个定制的（差异）基因表达实验数据库，该数据库旨在引发对广泛药物在不同浓度、持续时间和细胞系范围内的反应。这种应用通过计算每个实验的分数来实现，该分数衡量特征与实验之间的“接近度”。得分最高的实验以及相关药物随后被认为与查询所基于的疾病相关。通过重采样来进行支持此类推导的推断。在本文中，我们重新审视连接图谱实现的两个关键方面。首先，对于查询构成有序列表的常见情况，我们开发了新的方法来测量接近度。这些方法涉及使用为分析部分排序数据而提出的度量，这些度量本身就很有意义且未被广泛使用。其次，我们提出一种基于生成经验零分布的替代推断方法，该方法利用数据库所体现的范围并捕捉依赖性。利用这些改进，我们对连接图谱的结果进行了全面的重新评估，总体而言，结果表明适应有序查询的重要性不如推断模式。