• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

费加罗:一种用于去除向量序列的新型统计方法。

Figaro: a novel statistical method for vector sequence removal.

作者信息

White James Robert, Roberts Michael, Yorke James A, Pop Mihai

机构信息

Center for Bioinformatics and Computational Biology, University of Maryland - College Park, MD 20742, USA.

出版信息

Bioinformatics. 2008 Feb 15;24(4):462-7. doi: 10.1093/bioinformatics/btm632. Epub 2008 Jan 17.

DOI:10.1093/bioinformatics/btm632
PMID:18202027
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2725436/
Abstract

MOTIVATION

Sequences produced by automated Sanger sequencing machines frequently contain fragments of the cloning vector on their ends. Software tools currently available for identifying and removing the vector sequence require knowledge of the vector sequence, specific splice sites and any adapter sequences used in the experiment-information often omitted from public databases. Furthermore, the clipping coordinates themselves are missing or incorrectly reported. As an example, within the approximately 1.24 billion shotgun sequences deposited in the NCBI Trace Archive, as many as approximately 735 million (approximately 60%) lack vector clipping information. Correct clipping information is essential to scientists attempting to validate, improve and even finish the increasingly large number of genomes released at a 'draft' quality level.

RESULTS

We present here Figaro, a novel software tool for identifying and removing the vector from raw sequence data without prior knowledge of the vector sequence. The vector sequence is automatically inferred by analyzing the frequency of occurrence of short oligo-nucleotides using Poisson statistics. We show that Figaro achieves 99.98% sensitivity when tested on approximately 1.5 million shotgun reads from Drosophila pseudoobscura. We further explore the impact of accurate vector trimming on the quality of whole-genome assemblies by re-assembling two bacterial genomes from shotgun sequences deposited in the Trace Archive. Designed as a module in large computational pipelines, Figaro is fast, lightweight and flexible.

AVAILABILITY

Figaro is released under an open-source license through the AMOS package (http://amos.sourceforge.net/Figaro).

摘要

动机

自动桑格测序仪产生的序列末端常常包含克隆载体片段。目前用于识别和去除载体序列的软件工具需要载体序列、特定剪接位点以及实验中使用的任何接头序列的相关知识,而这些信息在公共数据库中常常被省略。此外,剪切坐标本身也缺失或报告有误。例如,在NCBI Trace Archive中存放的约12.4亿条鸟枪法测序序列中,多达约7.35亿条(约60%)缺乏载体剪切信息。正确的剪切信息对于试图验证、改进甚至完成以“草图”质量水平发布的越来越多基因组的科学家来说至关重要。

结果

我们在此展示Figaro,这是一种无需事先了解载体序列就能从原始序列数据中识别和去除载体的新型软件工具。通过使用泊松统计分析短寡核苷酸的出现频率来自动推断载体序列。我们表明,在对约150万条来自拟暗果蝇的鸟枪法读段进行测试时,Figaro的灵敏度达到了99.98%。我们还通过从Trace Archive中存放的鸟枪法序列重新组装两个细菌基因组,进一步探究了准确的载体修剪对全基因组组装质量的影响。Figaro被设计为大型计算流程中的一个模块,速度快、轻量级且灵活。

可用性

Figaro通过AMOS软件包(http://amos.sourceforge.net/Figaro)以开源许可发布。

相似文献

1
Figaro: a novel statistical method for vector sequence removal.费加罗:一种用于去除向量序列的新型统计方法。
Bioinformatics. 2008 Feb 15;24(4):462-7. doi: 10.1093/bioinformatics/btm632. Epub 2008 Jan 17.
2
Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology.注意差距:使用 Pacific Biosciences RS 长读测序技术升级基因组。
PLoS One. 2012;7(11):e47768. doi: 10.1371/journal.pone.0047768. Epub 2012 Nov 21.
3
WebTraceMiner: a web service for processing and mining EST sequence trace files.WebTraceMiner:一个用于处理和挖掘EST序列追踪文件的网络服务。
Nucleic Acids Res. 2007 Jul;35(Web Server issue):W137-42. doi: 10.1093/nar/gkm299. Epub 2007 May 8.
4
PET-Tool: a software suite for comprehensive processing and managing of Paired-End diTag (PET) sequence data.PET-Tool:一个用于对双末端二标签(PET)序列数据进行综合处理与管理的软件套件。
BMC Bioinformatics. 2006 Aug 25;7:390. doi: 10.1186/1471-2105-7-390.
5
QuorUM: An Error Corrector for Illumina Reads.QuorUM:Illumina测序读数的纠错工具
PLoS One. 2015 Jun 17;10(6):e0130821. doi: 10.1371/journal.pone.0130821. eCollection 2015.
6
An algorithm for automated closure during assembly.装配过程中的自动封口算法。
BMC Bioinformatics. 2010 Sep 10;11:457. doi: 10.1186/1471-2105-11-457.
7
ABACAS: algorithm-based automatic contiguation of assembled sequences.ABACAS:基于算法的组装序列自动拼接。
Bioinformatics. 2009 Aug 1;25(15):1968-9. doi: 10.1093/bioinformatics/btp347. Epub 2009 Jun 3.
8
Genome assembly forensics: finding the elusive mis-assembly.基因组组装取证:寻找难以捉摸的错误组装
Genome Biol. 2008;9(3):R55. doi: 10.1186/gb-2008-9-3-r55. Epub 2008 Mar 14.
9
Hawkeye: an interactive visual analytics tool for genome assemblies.鹰眼:一种用于基因组组装的交互式可视化分析工具。
Genome Biol. 2007;8(3):R34. doi: 10.1186/gb-2007-8-3-r34.
10
GFinisher: a new strategy to refine and finish bacterial genome assemblies.GFinisher:一种用于优化和完成细菌基因组组装的新策略。
Sci Rep. 2016 Oct 10;6:34963. doi: 10.1038/srep34963.

引用本文的文献

1
Introducing the UK Crop Microbiome Cryobank data resource, AgMicrobiomeBase, with case studies and methods on metabarcoding analyses.介绍英国作物微生物组冷冻库数据资源AgMicrobiomeBase,以及关于元条形码分析的案例研究和方法。
Environ Microbiome. 2025 Aug 21;20(1):108. doi: 10.1186/s40793-025-00768-5.
2
Improved reference genome for the domestic horse increases assembly contiguity and composition.改良后的家马参考基因组提高了组装的连续性和组成。
Commun Biol. 2018 Nov 16;1:197. doi: 10.1038/s42003-018-0199-z. eCollection 2018.
3
VecScreen_plus_taxonomy: imposing a tax(onomy) increase on vector contamination screening.VecScreen_plus_taxonomy:对载体污染筛查施加分类学税(onomy)增加。
Bioinformatics. 2018 Mar 1;34(5):755-759. doi: 10.1093/bioinformatics/btx669.
4
Reconstruction of the microalga Nannochloropsis salina genome-scale metabolic model with applications to lipid production.盐生微拟球藻基因组规模代谢模型的重建及其在脂质生产中的应用。
BMC Syst Biol. 2017 Jul 4;11(1):66. doi: 10.1186/s12918-017-0441-1.
5
MBBC: an efficient approach for metagenomic binning based on clustering.MBBC:一种基于聚类的宏基因组分箱高效方法。
BMC Bioinformatics. 2015 Feb 5;16:36. doi: 10.1186/s12859-015-0473-8.
6
High phylogenetic diversity of glycosyl hydrolase family 10 and 11 xylanases in the sediment of Lake Dabusu in China.中国大布苏湖沉积物中糖基水解酶家族10和11木聚糖酶的高系统发育多样性。
PLoS One. 2014 Nov 13;9(11):e112798. doi: 10.1371/journal.pone.0112798. eCollection 2014.
7
Rapid quantification of sequence repeats to resolve the size, structure and contents of bacterial genomes.快速定量序列重复以解析细菌基因组的大小、结构和内容。
BMC Genomics. 2013 Aug 8;14:537. doi: 10.1186/1471-2164-14-537.
8
A Microbial Metagenome (Leucobacter sp.) in Caenorhabditis Whole Genome Sequences.秀丽隐杆线虫全基因组序列中的一个微生物宏基因组(Leucobacter属)
Bioinform Biol Insights. 2013 Feb 24;7:55-72. doi: 10.4137/BBI.S11064. Print 2013.
9
Filtering duplicate reads from 454 pyrosequencing data.从 454 焦磷酸测序数据中过滤重复读取。
Bioinformatics. 2013 Apr 1;29(7):830-6. doi: 10.1093/bioinformatics/btt047. Epub 2013 Feb 1.
10
ESTclean: a cleaning tool for next-gen transcriptome shotgun sequencing.ESTclean:新一代转录组鸟枪法测序的清洗工具。
BMC Bioinformatics. 2012 Sep 26;13:247. doi: 10.1186/1471-2105-13-247.

本文引用的文献

1
The maize genome as a model for efficient sequence analysis of large plant genomes.玉米基因组作为大型植物基因组高效序列分析的模型。
Curr Opin Plant Biol. 2006 Apr;9(2):149-56. doi: 10.1016/j.pbi.2006.01.015. Epub 2006 Feb 3.
2
Genome sequencing in microfabricated high-density picolitre reactors.微制造高密度皮升反应器中的基因组测序
Nature. 2005 Sep 15;437(7057):376-80. doi: 10.1038/nature03959. Epub 2005 Jul 31.
3
Comparative genome sequencing of Drosophila pseudoobscura: chromosomal, gene, and cis-element evolution.拟暗果蝇的比较基因组测序:染色体、基因和顺式元件的进化
Genome Res. 2005 Jan;15(1):1-18. doi: 10.1101/gr.3059305.
4
Versatile and open software for comparing large genomes.用于比较大型基因组的通用且开放的软件。
Genome Biol. 2004;5(2):R12. doi: 10.1186/gb-2004-5-2-r12. Epub 2004 Jan 30.
5
Complete genome sequence of the Q-fever pathogen Coxiella burnetii.Q热病原体伯纳特柯克斯体的全基因组序列
Proc Natl Acad Sci U S A. 2003 Apr 29;100(9):5455-60. doi: 10.1073/pnas.0931379100. Epub 2003 Apr 18.
6
Genome sequence of Chlamydophila caviae (Chlamydia psittaci GPIC): examining the role of niche-specific genes in the evolution of the Chlamydiaceae.豚鼠嗜衣原体(鹦鹉热衣原体GPIC)的基因组序列:研究生态位特异性基因在衣原体科进化中的作用。
Nucleic Acids Res. 2003 Apr 15;31(8):2134-47. doi: 10.1093/nar/gkg321.
7
Fast algorithms for large-scale genome alignment and comparison.用于大规模基因组比对和比较的快速算法。
Nucleic Acids Res. 2002 Jun 1;30(11):2478-83. doi: 10.1093/nar/30.11.2478.
8
DNA sequence quality trimming and vector removal.DNA序列质量修剪和载体去除。
Bioinformatics. 2001 Dec;17(12):1093-104. doi: 10.1093/bioinformatics/17.12.1093.
9
An Eulerian path approach to DNA fragment assembly.一种用于DNA片段组装的欧拉路径方法。
Proc Natl Acad Sci U S A. 2001 Aug 14;98(17):9748-53. doi: 10.1073/pnas.171285098.
10
The sequence of the human genome.人类基因组序列。
Science. 2001 Feb 16;291(5507):1304-51. doi: 10.1126/science.1058040.