White James Robert, Roberts Michael, Yorke James A, Pop Mihai
Center for Bioinformatics and Computational Biology, University of Maryland - College Park, MD 20742, USA.
Bioinformatics. 2008 Feb 15;24(4):462-7. doi: 10.1093/bioinformatics/btm632. Epub 2008 Jan 17.
Sequences produced by automated Sanger sequencing machines frequently contain fragments of the cloning vector on their ends. Software tools currently available for identifying and removing the vector sequence require knowledge of the vector sequence, specific splice sites and any adapter sequences used in the experiment-information often omitted from public databases. Furthermore, the clipping coordinates themselves are missing or incorrectly reported. As an example, within the approximately 1.24 billion shotgun sequences deposited in the NCBI Trace Archive, as many as approximately 735 million (approximately 60%) lack vector clipping information. Correct clipping information is essential to scientists attempting to validate, improve and even finish the increasingly large number of genomes released at a 'draft' quality level.
We present here Figaro, a novel software tool for identifying and removing the vector from raw sequence data without prior knowledge of the vector sequence. The vector sequence is automatically inferred by analyzing the frequency of occurrence of short oligo-nucleotides using Poisson statistics. We show that Figaro achieves 99.98% sensitivity when tested on approximately 1.5 million shotgun reads from Drosophila pseudoobscura. We further explore the impact of accurate vector trimming on the quality of whole-genome assemblies by re-assembling two bacterial genomes from shotgun sequences deposited in the Trace Archive. Designed as a module in large computational pipelines, Figaro is fast, lightweight and flexible.
Figaro is released under an open-source license through the AMOS package (http://amos.sourceforge.net/Figaro).
自动桑格测序仪产生的序列末端常常包含克隆载体片段。目前用于识别和去除载体序列的软件工具需要载体序列、特定剪接位点以及实验中使用的任何接头序列的相关知识,而这些信息在公共数据库中常常被省略。此外,剪切坐标本身也缺失或报告有误。例如,在NCBI Trace Archive中存放的约12.4亿条鸟枪法测序序列中,多达约7.35亿条(约60%)缺乏载体剪切信息。正确的剪切信息对于试图验证、改进甚至完成以“草图”质量水平发布的越来越多基因组的科学家来说至关重要。
我们在此展示Figaro,这是一种无需事先了解载体序列就能从原始序列数据中识别和去除载体的新型软件工具。通过使用泊松统计分析短寡核苷酸的出现频率来自动推断载体序列。我们表明,在对约150万条来自拟暗果蝇的鸟枪法读段进行测试时,Figaro的灵敏度达到了99.98%。我们还通过从Trace Archive中存放的鸟枪法序列重新组装两个细菌基因组,进一步探究了准确的载体修剪对全基因组组装质量的影响。Figaro被设计为大型计算流程中的一个模块,速度快、轻量级且灵活。
Figaro通过AMOS软件包(http://amos.sourceforge.net/Figaro)以开源许可发布。