使用 NoBadWordsCombiner 合并和最小化针对多个真核基因注释数据库的 BLAST 命中的“不良词汇”的协议。

Protocol for using NoBadWordsCombiner to merge and minimize "bad words" from BLAST hits against multiple eukaryotic gene annotation databases.

机构信息

Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, NS B3H 4R2, Canada.

Institute for Comparative Genomics, Dalhousie University, Halifax, NS B3H 4R2, Canada.

出版信息

STAR Protoc. 2021 Oct 16;2(4):100888. doi: 10.1016/j.xpro.2021.100888. eCollection 2021 Dec 17.

DOI:10.1016/j.xpro.2021.100888

PMID:34704076

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8521201/

Abstract

Annotating protein-coding genes can be challenging, especially when searching for the best hits against multiple functional databases. This is partly because of "bad words" appearing as top hits, such as hypothetical or uncharacterized proteins. To help alleviate some of these issues, we designed a bioinformatics tool called NoBadWordsCombiner, which efficiently merges the hits from various databases, strengthening gene definitions by minimizing functional descriptions containing "bad words." Unlike other available tools, NoBadWordsCombiner is user friendly, but it does require users to have some general bioinformatics skills, including a basic understanding of the BLAST package and dash shell in Linux/Unix environments. For complete details on the use and execution of this protocol, please refer to Zhang et al. (2021a).

摘要

注释蛋白质编码基因可能具有挑战性，特别是在针对多个功能数据库搜索最佳匹配时。这在一定程度上是因为出现了“坏词”作为顶级匹配，例如假设或未表征的蛋白质。为了帮助缓解其中的一些问题，我们设计了一种名为 NoBadWordsCombiner 的生物信息学工具，它可以有效地合并来自各种数据库的命中结果，通过最小化包含“坏词”的功能描述来加强基因定义。与其他可用工具不同，NoBadWordsCombiner 用户友好，但它确实要求用户具备一些一般的生物信息学技能，包括对 BLAST 包和 Linux/Unix 环境中的 dash shell 的基本了解。有关此协议的使用和执行的详细信息，请参阅 Zhang 等人（2021a）。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acae/8521201/92ea34924f4f/fx1.jpg

相似文献

Protocol for using NoBadWordsCombiner to merge and minimize "bad words" from BLAST hits against multiple eukaryotic gene annotation databases.使用 NoBadWordsCombiner 合并和最小化针对多个真核基因注释数据库的 BLAST 命中的“不良词汇”的协议。

STAR Protoc. 2021 Oct 16;2(4):100888. doi: 10.1016/j.xpro.2021.100888. eCollection 2021 Dec 17.

AutoFACT: an automatic functional annotation and classification tool.自动事实：一种自动功能注释和分类工具。

BMC Bioinformatics. 2005 Jun 16;6:151. doi: 10.1186/1471-2105-6-151.

SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters.SS-Wrapper：用于在Linux集群上进行相似性搜索的一组包装应用程序。

BMC Bioinformatics. 2004 Oct 28;5:171. doi: 10.1186/1471-2105-5-171.

Protocol for HSDFinder: Identifying, annotating, categorizing, and visualizing duplicated genes in eukaryotic genomes.HSDFinder 方案：鉴定、注释、分类和可视化真核生物基因组中的重复基因。

STAR Protoc. 2021 Jun 23;2(3):100619. doi: 10.1016/j.xpro.2021.100619. eCollection 2021 Sep 17.

Recent Hits Acquired by BLAST (ReHAB): a tool to identify new hits in sequence similarity searches.通过BLAST获取的近期命中结果（ReHAB）：一种在序列相似性搜索中识别新命中结果的工具。

BMC Bioinformatics. 2005 Feb 8;6:23. doi: 10.1186/1471-2105-6-23.

Windows .NET Network Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST).Windows .NET网络分布式基本局部比对搜索工具包（W.ND-BLAST）。

BMC Bioinformatics. 2005 Apr 8;6:93. doi: 10.1186/1471-2105-6-93.

blastjs: a BLAST+ wrapper for Node.js.blastjs：一个用于Node.js的BLAST+包装器。

BMC Res Notes. 2016 Feb 27;9:130. doi: 10.1186/s13104-016-1938-1.

Curr Protoc Bioinformatics. 2009 Jun;Chapter 3:3.3.1-3.3.26. doi: 10.1002/0471250953.bi0303s26.

BRM-Parser: a tool for comprehensive analysis of BLAST and RepeatMasker results.BRM解析器：一种用于全面分析BLAST和重复序列掩码器结果的工具。

In Silico Biol. 2007;7(4-5):399-403.

Towards a reliable objective function for multiple sequence alignments.迈向用于多序列比对的可靠目标函数。

J Mol Biol. 2001 Dec 7;314(4):937-51. doi: 10.1006/jmbi.2001.5187.

引用本文的文献

How has structuring your ideas into protocols contributed to your research progress and facilitated collaboration with others in the field?将你的想法构建成方案如何促进了你的研究进展，并便利了你与该领域其他人的合作？

STAR Protoc. 2025 Jun 20;6(2):103873. doi: 10.1016/j.xpro.2025.103873. Epub 2025 Jun 4.

TreeTuner: A pipeline for minimizing redundancy and complexity in large phylogenetic datasets.TreeTuner：用于最小化大型系统发育数据集冗余和复杂性的管道。

STAR Protoc. 2022 Feb 15;3(1):101175. doi: 10.1016/j.xpro.2022.101175. eCollection 2022 Mar 18.

本文引用的文献

STAR Protoc. 2021 Jun 23;2(3):100619. doi: 10.1016/j.xpro.2021.100619. eCollection 2021 Sep 17.

Draft genome sequence of the Antarctic green alga sp. UWO241.南极绿藻UWO241菌株的基因组序列草图

iScience. 2021 Jan 20;24(2):102084. doi: 10.1016/j.isci.2021.102084. eCollection 2021 Feb 19.

BlastKOALA and GhostKOALA: KEGG Tools for Functional Characterization of Genome and Metagenome Sequences.BlastKOALA 和 GhostKOALA：KEGG 工具用于基因组和宏基因组序列的功能特征分析。

J Mol Biol. 2016 Feb 22;428(4):726-731. doi: 10.1016/j.jmb.2015.11.006. Epub 2015 Nov 14.

NCBI BLAST+ integrated into Galaxy.美国国立生物技术信息中心基本局部比对搜索工具升级版集成到星系项目中。（注：这里Galaxy可能是一个特定的项目名称，直译为“星系”，具体含义需结合相关背景确定）

Gigascience. 2015 Aug 25;4:39. doi: 10.1186/s13742-015-0080-7. eCollection 2015.

The simple fool's guide to population genomics via RNA-Seq: an introduction to high-throughput sequencing data analysis.通过 RNA-Seq 进行群体基因组学的傻瓜式入门指南：高通量测序数据分析简介。

Mol Ecol Resour. 2012 Nov;12(6):1058-67. doi: 10.1111/1755-0998.12003. Epub 2012 Aug 29.

A beginner's guide to eukaryotic genome annotation.真核生物基因组注释入门指南。

Nat Rev Genet. 2012 Apr 18;13(5):329-42. doi: 10.1038/nrg3174.

Conserved 'hypothetical' proteins: new hints and new puzzles.保守的“假设”蛋白质：新线索与新谜题。

Comp Funct Genomics. 2001;2(1):14-8. doi: 10.1002/cfg.66.

InterProScan: protein domains identifier.InterProScan：蛋白质结构域识别工具。

Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W116-20. doi: 10.1093/nar/gki442.

NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.NCBI参考序列（RefSeq）：一个经过整理的基因组、转录本和蛋白质的非冗余序列数据库。

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4. doi: 10.1093/nar/gki025.

The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.2003年的SWISS-PROT蛋白质知识库及其补充TrEMBL。

Nucleic Acids Res. 2003 Jan 1;31(1):365-70. doi: 10.1093/nar/gkg095.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验