Maia Guilherme Augusto, Filho Vilmar Benetti, Kawagoe Eric Kazuo, Teixeira Soratto Tatiany Aparecida, Moreira Renato Simões, Grisard Edmundo Carlos, Wagner Glauber
Laboratório de Bioinformática, Universidade Federal de Santa Catarina (UFSC), Campus João David Ferreira Lima, Florianópolis, Brazil.
Instituto Federal de Santa Catarina (IFSC), Campus Lages, Lages, Brazil.
Front Genet. 2022 Nov 22;13:1020100. doi: 10.3389/fgene.2022.1020100. eCollection 2022.
Assignment of gene function has been a crucial, laborious, and time-consuming step in genomics. Due to a variety of sequencing platforms that generates increasing amounts of data, manual annotation is no longer feasible. Thus, the need for an integrated, automated pipeline allowing the use of experimental data towards validation of prediction of gene function is of utmost relevance. Here, we present a computational workflow named AnnotaPipeline that integrates distinct software and data types on a proteogenomic approach to annotate and validate predicted features in genomic sequences. Based on FASTA (i) nucleotide or (ii) protein sequences or (iii) structural annotation files (GFF3), users can input FASTQ RNA-seq data, MS/MS data from mzXML or similar formats, as the pipeline uses both transcriptomic and proteomic information to corroborate annotations and validate gene prediction, providing transcription and expression evidence for functional annotation. Reannotation of the available , and genomes was performed using the AnnotaPipeline, resulting in a higher proportion of annotated proteins and a reduced proportion of hypothetical proteins when compared to the annotations publicly available for these organisms. AnnotaPipeline is a Unix-based pipeline developed using Python and is available at: https://github.com/bioinformatics-ufsc/AnnotaPipeline.
在基因组学中,基因功能的分配一直是一个关键、费力且耗时的步骤。由于各种测序平台产生的数据量不断增加,手动注释已不再可行。因此,迫切需要一个集成的自动化流程,允许使用实验数据来验证基因功能预测。在这里,我们提出了一个名为AnnotaPipeline的计算工作流程,它在蛋白质基因组学方法上整合了不同的软件和数据类型,以注释和验证基因组序列中的预测特征。基于FASTA(i)核苷酸序列、(ii)蛋白质序列或(iii)结构注释文件(GFF3),用户可以输入FASTQ RNA-seq数据、来自mzXML或类似格式的MS/MS数据,因为该工作流程使用转录组学和蛋白质组学信息来证实注释并验证基因预测,为功能注释提供转录和表达证据。使用AnnotaPipeline对可用的 、 和 基因组进行重新注释,与这些生物体公开可用的注释相比,注释蛋白质的比例更高,假设蛋白质的比例更低。AnnotaPipeline是一个基于Unix的工作流程,使用Python开发,可在以下网址获取:https://github.com/bioinformatics-ufsc/AnnotaPipeline。