一种用于计算推断转录组覆盖率和微阵列灵敏度的快速方法。

A rapid method for computationally inferring transcriptome coverage and microarray sensitivity.

作者信息

Reverter A, McWilliam S M, Barris W, Dalrymple B P

机构信息

Bioinformatics Group, CSIRO Livestock Industries, Queensland Bioscience Precinct, 306 Carmody Road, St Lucia, QLD 4067, Australia.

出版信息

Bioinformatics. 2005 Jan 1;21(1):80-9. doi: 10.1093/bioinformatics/bth472. Epub 2004 Aug 12.

DOI:10.1093/bioinformatics/bth472

PMID:15308544

Abstract

MOTIVATION

There are many different gene expression technologies, including cDNA and oligo-based microarrays, SAGE and MPSS. For each organism of interest, coverage of the transcriptome and the genome will be different. We address the question of what level of coverage is required to exploit the sensitivity of the different technologies, and what is the sensitivity of the different approaches in the experimental study.

RESULTS

We estimate the transcriptome coverage by randomly sampling transcripts from a pre-defined tag-to-gene mapping function. For a given microarray experiment, we locate the thresholds in intensities that define the distribution of transcript abundance. These values are compared against the distribution obtained by applying the same thresholds to the intensities from differentially expressed genes. The ratio of these two distributions meets at the equilibrium defining sensitivity. We conclude that a collection of approximately 340,000 sequences is adequate for microarrays, but not large enough for maximum utilization of tag-based technologies. In the absence of large-scale sequencing, the majority of the tags detected by the latter approaches will remain unidentified until the genome sequence is available.

摘要

动机

存在多种不同的基因表达技术，包括基于cDNA和寡核苷酸的微阵列、SAGE和MPSS。对于每种感兴趣的生物体，转录组和基因组的覆盖程度会有所不同。我们探讨了要利用不同技术的灵敏度需要何种覆盖水平，以及在实验研究中不同方法的灵敏度如何。

结果

我们通过从预定义的标签到基因映射函数中随机抽样转录本来估计转录组覆盖度。对于给定的微阵列实验，我们确定定义转录本丰度分布的强度阈值。将这些值与通过对差异表达基因的强度应用相同阈值获得的分布进行比较。这两种分布的比率在定义灵敏度的平衡点处相交。我们得出结论，大约340,000个序列的集合对于微阵列来说是足够的，但对于基于标签的技术的最大利用来说还不够大。在没有大规模测序的情况下，后一种方法检测到的大多数标签在基因组序列可用之前仍将无法识别。