一种用于Illumina Solexa数据集的高效注释和基因表达推导工具。

An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets.

作者信息

Hosseini Parsa, Tremblay Arianne, Matthews Benjamin F, Alkharouf Nadim W

机构信息

Jess and Mildred Fisher College of Science and Mathematics, Department of Computer and Information Sciences, Towson University, 7800 York Road, Towson, Maryland, 21252, USA.

出版信息

BMC Res Notes. 2010 Jul 2;3:183. doi: 10.1186/1756-0500-3-183.

DOI:10.1186/1756-0500-3-183

PMID:20598141

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2908109/

Abstract

BACKGROUND

The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value.

FINDINGS

We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations.

CONCLUSIONS

TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease.

摘要

背景

一个所有八个泳道都被占用的Illumina流动槽所产生的数据，在序列比对后会产生超过一太字节的图像以及千兆字节的读数。因此，将这些读数转化为有意义注释的能力备受关注且至关重要。无论读数质量或大小如何，人们很容易就会被如此大量的文本、未注释数据淹没。CASAVA是Illumina测序实验的一个可选分析工具，它能够实现对插入缺失检测、单核苷酸多态性信息和等位基因分型的理解。因此，不仅要从这种分析中提取以标签计数形式表示的基因表达量度，还要对这些读数进行注释，这具有重要价值。

研究结果

我们开发了TASE（Solexa实验的标签计数与分析），这是一个专门为Illumina CASAVA测序数据集设计的快速标签计数和注释软件工具。TASE用Java开发，并使用jTDS JDBC驱动程序和SQL Server后端进行部署，它提供了一种极其快速的方法，通过标签计数来计算基因表达，同时用基因的假定功能注释测序读数，这些读数来自任何给定的CASAVA版本。这样的版本针对DNA和RNA测序都有生成。分析分为两个不同的部分：DNA序列或读数拼接，然后是标签计数和注释。最终结果产生的输出包含基于同源性的功能注释以及各自的基因表达量度，表明在功能注释的基因组范围内发现测序读数的次数。

结论

TASE是一个强大的工具，有助于对给定的Illumina Solexa测序数据集进行注释过程。我们的结果表明，基于同源性的注释和标签计数分析都能在非常高效的时间内完成，使研究人员能够深入研究给定的CASAVA版本，并从测序数据集中最大限度地提取信息。TASE专门设计用于将CASAVA版本中的序列数据转化为功能注释，同时生成相应的基因表达测量值。无论分析是单读还是双端测序实验，这种分析都以超快速和高效的方式执行。TASE是一个用户友好且免费可用的应用程序，能够轻松地对任何给定的Illumina Solexa测序数据集进行快速分析和注释。