USDA-Agricultural Research Service, Corn Insects and Crop Genetics Research Unit, 819 Wallace Rd., Ames, IA 50011, United States.
ORISE Fellow, USDA-Agricultural Research Service, Corn Insects and Crop Genetics Research Unit, 819 Wallace Rd., Ames, IA 50011, United States.
Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae526.
Identification of allelic or corresponding genes (pan-genes) within a species or genus is important for discovery of biologically significant genetic conservation and variation. Similarly, identification of orthologs (gene families) across wider evolutionary distances is important for understanding the genetic basis for similar or differing traits. Especially in plants, several complications make identification of pan-genes and gene families challenging, including whole-genome duplications, evolutionary rate differences among lineages, and varying qualities of assemblies and annotations. Here, we document and distribute a set of workflows that we have used to address these problems.
Pandagma is a set of configurable workflows for identifying and comparing pan-gene sets and gene families for annotation sets from eukaryotic genomes, using a combination of homology, synteny, and expected rates of synonymous change in coding sequence.
The Pandagma workflows, example configurations, implementation details, and scripts for retrieving public datasets, are available at https://github.com/legumeinfo/pandagma.
在一个物种或属内鉴定等位基因或相应基因(泛基因)对于发现具有生物学意义的遗传保守性和变异性很重要。同样,在更广泛的进化距离上鉴定同源基因(基因家族)对于理解相似或不同特征的遗传基础也很重要。特别是在植物中,由于全基因组加倍、谱系间进化率差异以及组装和注释质量的不同,使得泛基因和基因家族的鉴定变得具有挑战性。在这里,我们记录并分发了一组我们用于解决这些问题的工作流程。
Pandagma 是一组可配置的工作流程,用于使用同源性、共线性和编码序列中同义突变的预期速率的组合,识别和比较真核基因组注释集的泛基因集和基因家族。
Pandagma 工作流程、示例配置、实现细节以及用于检索公共数据集的脚本可在 https://github.com/legumeinfo/pandagma 上获得。