Molecular, Cellular, and Biomedical Sciences Department, University of New Hampshire, Durham, NH, 03824, USA.
Hubbard Center for Genome Studies, University of New Hampshire, Durham, NH, 03824, USA.
BMC Ecol Evol. 2021 Mar 16;21(1):43. doi: 10.1186/s12862-021-01772-2.
Phylogenomic approaches have great power to reconstruct evolutionary histories, however they rely on multi-step processes in which each stage has the potential to affect the accuracy of the final result. Many studies have empirically tested and established methodology for resolving robust phylogenies, including selecting appropriate evolutionary models, identifying orthologs, or isolating partitions with strong phylogenetic signal. However, few have investigated errors that may be initiated at earlier stages of the analysis. Biases introduced during the generation of the phylogenomic dataset itself could produce downstream effects on analyses of evolutionary history. Transcriptomes are widely used in phylogenomics studies, though there is little understanding of how a poor-quality assembly of these datasets could impact the accuracy of phylogenomic hypotheses. Here we examined how transcriptome assembly quality affects phylogenomic inferences by creating independent datasets from the same input data representing high-quality and low-quality transcriptome assembly outcomes.
By studying the performance of phylogenomic datasets derived from alternative high- and low-quality assembly inputs in a controlled experiment, we show that high-quality transcriptomes produce richer phylogenomic datasets with a greater number of unique partitions than low-quality assemblies. High-quality assemblies also give rise to partitions that have lower alignment ambiguity and less compositional bias. In addition, high-quality partitions hold stronger phylogenetic signal than their low-quality transcriptome assembly counterparts in both concatenation- and coalescent-based analyses.
Our findings demonstrate the importance of transcriptome assembly quality in phylogenomic analyses and suggest that a portion of the uncertainty observed in such studies could be alleviated at the assembly stage.
系统发生基因组学方法具有重建进化史的强大能力,但它们依赖于多步骤的过程,每个阶段都有可能影响最终结果的准确性。许多研究已经通过经验检验并建立了用于解决稳健系统发生树的方法,包括选择适当的进化模型、识别直系同源物或分离具有强烈系统发生信号的分区。然而,很少有研究调查可能在分析的早期阶段引发的错误。在产生基因组数据集本身的过程中引入的偏差可能会对进化史的分析产生下游影响。转录组广泛用于系统发生基因组学研究中,但对于这些数据集的低质量组装如何影响系统发生假说的准确性知之甚少。在这里,我们通过从代表高质量和低质量转录组组装结果的相同输入数据创建独立数据集,研究了转录组组装质量如何影响系统发生基因组学推断。
通过在受控实验中研究来自替代高质量和低质量组装输入的基因组数据集的性能,我们表明高质量转录组产生的基因组数据集比低质量组装具有更丰富的数据集,具有更多独特的分区。高质量组装还产生了具有较低对齐模糊性和较少组成偏差的分区。此外,高质量分区在基于合并和合并的分析中比其低质量转录组组装对应物具有更强的系统发生信号。
我们的研究结果表明转录组组装质量在基因组学分析中的重要性,并表明在这种研究中观察到的不确定性的一部分可以在组装阶段得到缓解。