UnivLyon, Université Claude Bernard Lyon 1, ENS de Lyon, CNRS UMR, INSERM U1210, LBMC, F-69007, Lyon, France.
UnivLyon, Université Claude Bernard Lyon 1, CNRS, UMR, LBBE, F-69100, Villeurbanne, France.
Bioinformatics. 2019 Jul 1;35(13):2199-2207. doi: 10.1093/bioinformatics/bty903.
RNA sequencing (RNA-Seq) is a widely used approach to obtain transcript sequences in non-model organisms, notably for performing comparative analyses. However, current bioinformatic pipelines do not take full advantage of pre-existing reference data in related species for improving RNA-Seq assembly, annotation and gene family reconstruction.
We built an automated pipeline named CAARS to combine novel data from RNA-Seq experiments with existing multi-species gene family alignments. RNA-Seq reads are assembled into transcripts by both de novo and assisted assemblies. Then, CAARS incorporates transcripts into gene families, builds gene alignments and trees and uses phylogenetic information to classify the genes as orthologs and paralogs of existing genes. We used CAARS to assemble and annotate RNA-Seq data in rodents and fishes using distantly related genomes as reference, a difficult case for this kind of analysis. We showed CAARS assemblies are more complete and accurate than those assembled by a standard pipeline consisting of de novo assembly coupled with annotation by sequence similarity on a guide species. In addition to annotated transcripts, CAARS provides gene family alignments and trees, annotated with orthology relationships, directly usable for downstream comparative analyses.
CAARS is implemented in Python and Ocaml and is freely available at https://github.com/carinerey/caars.
Supplementary data are available at Bioinformatics online.
RNA 测序(RNA-Seq)是一种广泛用于获得非模式生物转录本序列的方法,特别是用于进行比较分析。然而,当前的生物信息学流程并没有充分利用相关物种中现有的参考数据来改进 RNA-Seq 组装、注释和基因家族重建。
我们构建了一个名为 CAARS 的自动化管道,该管道将来自 RNA-Seq 实验的新数据与多物种基因家族比对结合起来。RNA-Seq 读取通过从头组装和辅助组装来组装成转录本。然后,CAARS 将转录本合并到基因家族中,构建基因比对和树,并利用系统发育信息将基因分类为现有基因的直系同源物和旁系同源物。我们使用 CAARS 以远缘基因组作为参考来组装和注释啮齿动物和鱼类的 RNA-Seq 数据,这是此类分析的一个难题。我们表明,CAARS 组装的完整性和准确性均优于由从头组装与指导物种序列相似性注释相结合的标准管道组装的结果。除了注释的转录本外,CAARS 还提供了基因家族比对和树,以及注释的直系同源关系,可直接用于下游的比较分析。
CAARS 是用 Python 和 Ocaml 编写的,并可在 https://github.com/carinerey/caars 上免费获取。
补充数据可在 Bioinformatics 在线获取。