AnnotaPipeline：一种利用多组学数据注释真核生物蛋白质的综合工具。

AnnotaPipeline: An integrated tool to annotate eukaryotic proteins using multi-omics data.

作者信息

Maia Guilherme Augusto, Filho Vilmar Benetti, Kawagoe Eric Kazuo, Teixeira Soratto Tatiany Aparecida, Moreira Renato Simões, Grisard Edmundo Carlos, Wagner Glauber

机构信息

Laboratório de Bioinformática, Universidade Federal de Santa Catarina (UFSC), Campus João David Ferreira Lima, Florianópolis, Brazil.

Instituto Federal de Santa Catarina (IFSC), Campus Lages, Lages, Brazil.

出版信息

Front Genet. 2022 Nov 22;13:1020100. doi: 10.3389/fgene.2022.1020100. eCollection 2022.

DOI:10.3389/fgene.2022.1020100

PMID:36482896

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9723129/

Abstract

Assignment of gene function has been a crucial, laborious, and time-consuming step in genomics. Due to a variety of sequencing platforms that generates increasing amounts of data, manual annotation is no longer feasible. Thus, the need for an integrated, automated pipeline allowing the use of experimental data towards validation of prediction of gene function is of utmost relevance. Here, we present a computational workflow named AnnotaPipeline that integrates distinct software and data types on a proteogenomic approach to annotate and validate predicted features in genomic sequences. Based on FASTA (i) nucleotide or (ii) protein sequences or (iii) structural annotation files (GFF3), users can input FASTQ RNA-seq data, MS/MS data from mzXML or similar formats, as the pipeline uses both transcriptomic and proteomic information to corroborate annotations and validate gene prediction, providing transcription and expression evidence for functional annotation. Reannotation of the available , and genomes was performed using the AnnotaPipeline, resulting in a higher proportion of annotated proteins and a reduced proportion of hypothetical proteins when compared to the annotations publicly available for these organisms. AnnotaPipeline is a Unix-based pipeline developed using Python and is available at: https://github.com/bioinformatics-ufsc/AnnotaPipeline.

摘要

在基因组学中，基因功能的分配一直是一个关键、费力且耗时的步骤。由于各种测序平台产生的数据量不断增加，手动注释已不再可行。因此，迫切需要一个集成的自动化流程，允许使用实验数据来验证基因功能预测。在这里，我们提出了一个名为AnnotaPipeline的计算工作流程，它在蛋白质基因组学方法上整合了不同的软件和数据类型，以注释和验证基因组序列中的预测特征。基于FASTA（i）核苷酸序列、（ii）蛋白质序列或（iii）结构注释文件（GFF3），用户可以输入FASTQ RNA-seq数据、来自mzXML或类似格式的MS/MS数据，因为该工作流程使用转录组学和蛋白质组学信息来证实注释并验证基因预测，为功能注释提供转录和表达证据。使用AnnotaPipeline对可用的、和基因组进行重新注释，与这些生物体公开可用的注释相比，注释蛋白质的比例更高，假设蛋白质的比例更低。AnnotaPipeline是一个基于Unix的工作流程，使用Python开发，可在以下网址获取：https://github.com/bioinformatics-ufsc/AnnotaPipeline。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7cff/9723129/6faa173b0997/fgene-13-1020100-g001.jpg

相似文献

AnnotaPipeline: An integrated tool to annotate eukaryotic proteins using multi-omics data.

Front Genet. 2022 Nov 22;13:1020100. doi: 10.3389/fgene.2022.1020100. eCollection 2022.

Improving Gene Annotation of the Peanut Genome by Integrated Proteogenomics Workflow.

J Proteome Res. 2020 Jun 5;19(6):2226-2235. doi: 10.1021/acs.jproteome.9b00723. Epub 2020 May 15.

GAD: A Python Script for Dividing Genome Annotation Files into Feature-Based Files.

Interdiscip Sci. 2020 Sep;12(3):377-381. doi: 10.1007/s12539-020-00378-4. Epub 2020 Jun 10.

Integrated Transcriptomic-Proteomic Analysis Using a Proteogenomic Workflow Refines Rat Genome Annotation.

Mol Cell Proteomics. 2016 Jan;15(1):329-39. doi: 10.1074/mcp.M114.047126. Epub 2015 Nov 11.

FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences.

BMC Bioinformatics. 2021 Apr 20;22(1):205. doi: 10.1186/s12859-021-04120-9.

Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification.

Microb Genom. 2021 Nov;7(11). doi: 10.1099/mgen.0.000685.

ggcoverage: an R package to visualize and annotate genome coverage for various NGS data.

BMC Bioinformatics. 2023 Aug 9;24(1):309. doi: 10.1186/s12859-023-05438-2.

Peptimapper: proteogenomics workflow for the expert annotation of eukaryotic genomes.

BMC Genomics. 2019 Jan 17;20(1):56. doi: 10.1186/s12864-019-5431-9.

AGeS: a software system for microbial genome sequence annotation.

PLoS One. 2011 Mar 7;6(3):e17469. doi: 10.1371/journal.pone.0017469.

本文引用的文献

The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences.

Nucleic Acids Res. 2022 Jan 7;50(D1):D543-D552. doi: 10.1093/nar/gkab1038.

VEuPathDB: the eukaryotic pathogen, vector and host bioinformatics resource center.

Nucleic Acids Res. 2022 Jan 7;50(D1):D898-D911. doi: 10.1093/nar/gkab929.

FA-nf: A Functional Annotation Pipeline for Proteins from Non-Model Organisms Implemented in Nextflow.

Genes (Basel). 2021 Oct 19;12(10):1645. doi: 10.3390/genes12101645.

PANNZER-A practical tool for protein function prediction.

Protein Sci. 2022 Jan;31(1):118-128. doi: 10.1002/pro.4193. Epub 2021 Oct 14.

Sensitive protein alignments at tree-of-life scale using DIAMOND.

Nat Methods. 2021 Apr;18(4):366-368. doi: 10.1038/s41592-021-01101-x. Epub 2021 Apr 7.

Repeat-Driven Generation of Antigenic Diversity in a Major Human Pathogen, .

Front Cell Infect Microbiol. 2021 Mar 3;11:614665. doi: 10.3389/fcimb.2021.614665. eCollection 2021.

UniProt: the universal protein knowledgebase in 2021.

Nucleic Acids Res. 2021 Jan 8;49(D1):D480-D489. doi: 10.1093/nar/gkaa1100.

GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins.

NAR Genom Bioinform. 2020 Jun;2(2):lqaa026. doi: 10.1093/nargab/lqaa026. Epub 2020 May 13.

Reviving the Transcriptome Studies: An Insight Into the Emergence of Single-Molecule Transcriptome Sequencing.

Front Genet. 2019 Apr 26;10:384. doi: 10.3389/fgene.2019.00384. eCollection 2019.

No wisdom in the crowd: genome annotation in the era of big data - current status and future prospects.

Microb Biotechnol. 2018 Jul;11(4):588-605. doi: 10.1111/1751-7915.13284. Epub 2018 May 28.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

AnnotaPipeline：一种利用多组学数据注释真核生物蛋白质的综合工具。

AnnotaPipeline: An integrated tool to annotate eukaryotic proteins using multi-omics data.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献