InBioS - PhytoSYSTEMS, Eukaryotic Phylogenomics, University of Liège, Liège, Belgium.
Station D'Ecologie Théorique Et Expérimentale de Moulis, UMR CNRS 5321, Moulis, France.
BMC Res Notes. 2021 Apr 17;14(1):143. doi: 10.1186/s13104-021-05553-4.
Identifying orthology relationships among sequences is essential to understand evolution, diversity of life and ancestry among organisms. To build alignments of orthologous sequences, phylogenomic pipelines often start with all-vs-all similarity searches, followed by a clustering step. For the protein clusters (orthogroups) to be as accurate as possible, proteomes of good quality are needed. Here, our objective is to assemble a data set especially suited for the phylogenomic study of algae and formerly photosynthetic eukaryotes, which implies the proper integration of organellar data, to enable distinguishing between several copies of one gene (paralogs), taking into account their cellular compartment, if necessary.
We submitted 73 top-quality and taxonomically diverse proteomes to OrthoFinder. We obtained 47,266 orthogroups and identified 11,775 orthogroups with at least two algae. Whenever possible, sequences were functionally annotated with eggNOG and tagged after their genomic and target compartment(s). Then we aligned and computed phylogenetic trees for the orthogroups with IQ-TREE. Finally, these trees were further processed by identifying and pruning the subtrees exclusively composed of plastid-bearing organisms to yield a set of 31,784 clans suitable for studying photosynthetic organism genome evolution.
鉴定序列的同源关系对于理解生物的进化、生命多样性和祖先至关重要。为了构建同源序列的比对,基因组学分析流程通常从全对全相似性搜索开始,然后是聚类步骤。为了使蛋白质聚类(同源物)尽可能准确,需要高质量的蛋白质组。在这里,我们的目标是组装一个特别适合藻类和以前的光合真核生物系统发生研究的数据集,这意味着需要适当整合细胞器数据,以便能够区分一个基因的多个副本(旁系同源物),并考虑到它们的细胞区室,如果有必要的话。
我们向 OrthoFinder 提交了 73 个高质量和分类多样化的蛋白质组。我们获得了 47266 个同源物,并鉴定了至少有两个藻类的 11775 个同源物。在可能的情况下,使用 eggNOG 对序列进行功能注释,并根据其基因组和目标区室进行标记。然后,我们使用 IQ-TREE 对同源物进行了对齐和计算系统发育树。最后,通过识别和修剪仅由含有质体的生物组成的子树,对这些树进行进一步处理,生成了一组 31784 个适合研究光合生物基因组进化的家族。