一种优化的程序极大地提高了EST载体污染去除率。

An optimized procedure greatly improves EST vector contamination removal.

作者信息

Chen Yi-An, Lin Chang-Chun, Wang Chin-Di, Wu Huan-Bin, Hwang Pei-Ing

机构信息

Bioinformatics Core Laboratory, Agricultural Biotechnology Research Center, Academia Sinica, Taipei, Taiwan.

出版信息

BMC Genomics. 2007 Nov 13;8:416. doi: 10.1186/1471-2164-8-416.

DOI:10.1186/1471-2164-8-416

PMID:17997864

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2194723/

Abstract

BACKGROUND

The enormous amount of sequence data available in the public domain database has been a gold mine for researchers exploring various themes in life sciences, and hence the quality of such data is of serious concern to researchers. Removal of vector contamination is one of the most significant operations to obtain accurate sequence data containing only a cDNA insert from the basecalls output by an automatic DNA sequencer. Popular bioinformatics programs to accomplish vector trimming include LUCY, cross_match and SeqClean.

RESULTS

In a recent study, where the program SeqClean was used to remove vector contamination from our test set of EST data compiled through various library construction systems, however, a significant number of errors remained after preliminary trimming. These errors were later almost completely corrected by simply using a re-linearized form of the cloning vector to compare against the target ESTs. The modified trimming procedure for SeqClean was also compared with the trimming efficiency of the other two popular programs, LUCY2, and cross_match. Using SeqClean with a re-linearized form of the cloning vector significantly surpassed the other two programs in all tested conditions, while the performance of the other two programs was not influenced by the modified procedure. Vector contamination in dbEST was also investigated in this study: 2203 out of the 48212 ESTs sampled from dbEST (2007-04-18 freeze) were found to match sequences in UNIVEC.

CONCLUSION

Vector contamination remains a serious concern to the data quality in the public sequence database nowadays. Based on the results presented here, we feel that our modified procedure with SeqClean should be recommended to all researchers for the task of vector removal from EST or genomic sequences.

摘要

背景

公共领域数据库中大量的序列数据已成为探索生命科学各个主题的研究人员的宝库，因此这些数据的质量受到研究人员的严重关注。去除载体污染是从自动DNA测序仪输出的碱基序列中获得仅包含cDNA插入片段的准确序列数据的最重要操作之一。用于完成载体修剪的流行生物信息学程序包括LUCY、cross_match和SeqClean。

结果

然而，在最近一项研究中，使用SeqClean程序从我们通过各种文库构建系统编译的EST数据测试集中去除载体污染时，初步修剪后仍存在大量错误。后来，通过简单地使用克隆载体的重新线性化形式与目标EST进行比较，这些错误几乎完全得到了纠正。还将SeqClean的改进修剪程序与其他两个流行程序LUCY2和cross_match的修剪效率进行了比较。在所有测试条件下，使用带有克隆载体重新线性化形式的SeqClean显著超过了其他两个程序，而其他两个程序的性能不受修改程序的影响。本研究还调查了dbEST中的载体污染情况：从dbEST（2007年4月18日冻结）中抽样的48212个EST中，有2203个与UNIVEC中的序列匹配。

结论

如今，载体污染仍然是公共序列数据库数据质量的严重问题。基于此处给出的结果，我们认为应向所有研究人员推荐我们改进后的SeqClean程序，用于从EST或基因组序列中去除载体的任务。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/151c/2194723/5536ede2096b/1471-2164-8-416-1.jpg

相似文献

An optimized procedure greatly improves EST vector contamination removal.

BMC Genomics. 2007 Nov 13;8:416. doi: 10.1186/1471-2164-8-416.

Pattern analysis approach reveals restriction enzyme cutting abnormalities and other cDNA library construction artifacts using raw EST data.

BMC Biotechnol. 2012 May 3;12:16. doi: 10.1186/1472-6750-12-16.

Peanut gene expression profiling in developing seeds at different reproduction stages during Aspergillus parasiticus infection.

BMC Dev Biol. 2008 Feb 4;8:12. doi: 10.1186/1471-213X-8-12.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

ConiferEST: an integrated bioinformatics system for data reprocessing and mining of conifer expressed sequence tags (ESTs).

BMC Genomics. 2007 May 29;8:134. doi: 10.1186/1471-2164-8-134.

CleanEST: a database of cleansed EST libraries.

Nucleic Acids Res. 2009 Jan;37(Database issue):D686-9. doi: 10.1093/nar/gkn648. Epub 2008 Oct 2.

Generation and analysis of 113 adult stage Schistosoma japonicum (Chinese strain) expressed sequence tags.

Chin Med J (Engl). 2002 Oct;115(10):1517-20.

Human trash ESTs--sequences from cDNA collection that are not aligned to genome assembly.

J Bioinform Comput Biol. 2008 Aug;6(4):759-73. doi: 10.1142/s0219720008003709.

WebTraceMiner: a web service for processing and mining EST sequence trace files.

Nucleic Acids Res. 2007 Jul;35(Web Server issue):W137-42. doi: 10.1093/nar/gkm299. Epub 2007 May 8.

A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases.

Bioinformatics. 1999 Feb;15(2):111-21. doi: 10.1093/bioinformatics/15.2.111.

引用本文的文献

Caecilians maintain a functional long-wavelength-sensitive cone opsin gene despite signatures of relaxed selection and more than 200 million years of fossoriality.

bioRxiv. 2025 Feb 8:2025.02.07.636964. doi: 10.1101/2025.02.07.636964.

Chromosome-level and haplotype-resolved genome assembly of Bougainvillea glabra.

Sci Data. 2025 Jan 18;12(1):107. doi: 10.1038/s41597-024-04333-6.

A high-quality genome provides new insights into evolutionary history and pigment biosynthetic pathways in the Caryophyllales.

Hortic Res. 2023 Jun 13;10(8):uhad124. doi: 10.1093/hr/uhad124. eCollection 2023 Aug.

A highly contiguous genome assembly reveals sources of genomic novelty in the symbiotic fungus Rhizophagus irregularis.

G3 (Bethesda). 2023 Jun 1;13(6). doi: 10.1093/g3journal/jkad077.

Prediction of neuropeptide precursors and differential expression of adipokinetic hormone/corazonin-related peptide, hugin and corazonin in the brain of malaria vector during a infection.

Curr Res Insect Sci. 2021 Apr 22;1:100014. doi: 10.1016/j.cris.2021.100014. eCollection 2021.

Transcriptional Basis for Haustorium Formation and Host Establishment in Hemiparasitic Mistletoes.

Front Genet. 2022 Jun 13;13:929490. doi: 10.3389/fgene.2022.929490. eCollection 2022.

Molecular characterization of a flatworm Girardia isolate from Guanajuato, Mexico.

Dev Biol. 2022 Sep;489:165-177. doi: 10.1016/j.ydbio.2022.06.003. Epub 2022 Jun 13.

Hypothesis: Trans-splicing Generates Evolutionary Novelty in the Photosynthetic Amoeba Paulinella.

J Phycol. 2022 Jun;58(3):392-405. doi: 10.1111/jpy.13247. Epub 2022 Mar 25.

Elephant Genomes Reveal Accelerated Evolution in Mechanisms Underlying Disease Defenses.

Mol Biol Evol. 2021 Aug 23;38(9):3606-3620. doi: 10.1093/molbev/msab127.

TagSeq for gene expression in non-model plants: A pilot study at the Santa Rita Experimental Range NEON core site.

Appl Plant Sci. 2020 Nov 22;8(11):e11398. doi: 10.1002/aps3.11398. eCollection 2020 Nov.

本文引用的文献

Bioinformatics of the Paracoccidioides brasiliensis EST Project.

Genet Mol Res. 2005 Jun 30;4(2):203-15.

Profile and analysis of gene expression changes during early development in germinating spores of Ceratopteris richardii.

Plant Physiol. 2005 Jul;138(3):1734-45. doi: 10.1104/pp.105.062851. Epub 2005 Jun 17.

EST data suggest that poplar is an ancient polyploid.

New Phytol. 2005 Jul;167(1):165-70. doi: 10.1111/j.1469-8137.2005.01378.x.

PartiGene--constructing partial genomes.

Bioinformatics. 2004 Jun 12;20(9):1398-404. doi: 10.1093/bioinformatics/bth101. Epub 2004 Feb 26.

A strategy for assembling the maize (Zea mays L.) genome.

Bioinformatics. 2004 Jan 22;20(2):140-7. doi: 10.1093/bioinformatics/bth017.

DNA sequence quality trimming and vector removal.

Bioinformatics. 2001 Dec;17(12):1093-104. doi: 10.1093/bioinformatics/17.12.1093.

Quality control in databanks for molecular biology.

Bioessays. 2000 Nov;22(11):1024-34. doi: 10.1002/1521-1878(200011)22:11<1024::AID-BIES9>3.0.CO;2-W.

A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases.

Bioinformatics. 1999 Feb;15(2):111-21. doi: 10.1093/bioinformatics/15.2.111.

Establishing a method of vector contamination identification in database sequences.

Bioinformatics. 1999 Feb;15(2):106-10. doi: 10.1093/bioinformatics/15.2.106.

Base-calling of automated sequencer traces using phred. II. Error probabilities.

Genome Res. 1998 Mar;8(3):186-94.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种优化的程序极大地提高了EST载体污染去除率。

An optimized procedure greatly improves EST vector contamination removal.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献