TrAnnoScope：用于全长转录组分析和功能注释的模块化Snakemake工作流程

TrAnnoScope: A Modular Snakemake Pipeline for Full-Length Transcriptome Analysis and Functional Annotation.

作者信息

Pektas Aysevil, Panitz Frank, Thomsen Bo

机构信息

Department of Molecular Biology and Genetics, Aarhus University, 8000 Aarhus, Denmark.

Applied Statistical Methods, Natural Resources Institute Finland (Luke), 20520 Turku, Finland.

出版信息

Genes (Basel). 2024 Nov 29;15(12):1547. doi: 10.3390/genes15121547.

DOI:10.3390/genes15121547

PMID:39766814

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11727683/

Abstract

: Transcriptome assembly and functional annotation are essential in understanding gene expression and biological function. Nevertheless, many existing pipelines lack the flexibility to integrate both short- and long-read sequencing data or fail to provide a complete, customizable workflow for transcriptome analysis, particularly for non-model organisms. : We present TrAnnoScope, a transcriptome analysis pipeline designed to process Illumina short-read and PacBio long-read data. The pipeline provides a complete, customizable workflow to generate high-quality, full-length (FL) transcripts with broad functional annotation. Its modular design allows users to adapt specific analysis steps for other sequencing platforms or data types. The pipeline encompasses steps from quality control to functional annotation, employing tools and established databases such as SwissProt, Pfam, Gene Ontology (GO), the Kyoto Encyclopedia of Genes and Genomes (KEGG), and Eukaryotic Orthologous Groups (KOG). As a case study, TrAnnoScope was applied to RNA-Seq and Iso-Seq data from zebra finch brain, ovary, and testis tissue. : The zebra finch transcriptome generated by TrAnnoScope from the brain, ovary, and testis tissue demonstrated strong alignment with the reference genome (99.63%), and it was found that 93.95% of the matched protein sequences in the zebra finch proteome were captured as nearly complete. Functional annotation provided matches to known protein databases and assigned relevant functional terms to the majority of the transcripts. : TrAnnoScope successfully integrates short and long sequencing technologies to generate transcriptomes with minimal user input. Its modularity and ease of use make it a valuable tool for researchers analyzing complex datasets, particularly for non-model organisms.

摘要

转录组组装和功能注释对于理解基因表达和生物学功能至关重要。然而，许多现有的流程缺乏整合短读长和长读长测序数据的灵活性，或者未能提供完整的、可定制的转录组分析工作流程，特别是对于非模式生物。

我们提出了TrAnnoScope，这是一种转录组分析流程，旨在处理Illumina短读长和PacBio长读长数据。该流程提供了一个完整的、可定制的工作流程，以生成具有广泛功能注释的高质量全长（FL）转录本。其模块化设计允许用户针对其他测序平台或数据类型调整特定的分析步骤。该流程涵盖了从质量控制到功能注释的各个步骤，使用了诸如SwissProt、Pfam、基因本体（GO）、京都基因与基因组百科全书（KEGG）以及真核直系同源组（KOG）等工具和既定数据库。作为一个案例研究，TrAnnoScope被应用于斑胸草雀脑、卵巢和睾丸组织的RNA测序和全长转录组测序（Iso-Seq）数据。

TrAnnoScope从脑、卵巢和睾丸组织生成的斑胸草雀转录组与参考基因组显示出高度的比对（99.63%），并且发现斑胸草雀蛋白质组中93.95%的匹配蛋白质序列被捕获为几乎完整。功能注释提供了与已知蛋白质数据库的匹配，并为大多数转录本赋予了相关的功能术语。

TrAnnoScope成功整合了短读长和长读长测序技术，只需最少的用户输入就能生成转录组。其模块化和易用性使其成为研究人员分析复杂数据集，特别是非模式生物的有价值工具。