Suissa Jacob S, De La Cerda Gisel Y, Graber Leland C, Jelley Chloe, Wickell David, Phillips Heather R, Grinage Ayress D, Moreau Corrie S, Specht Chelsea D, Doyle Jeff J, Landis Jacob B
Department of Ecology and Evolutionary Biology University of Tennessee at Knoxville Knoxville Tennessee USA.
School of Integrative Plant Science, Section of Plant Biology and the L. H. Bailey Hortorium Cornell University Ithaca New York USA.
Appl Plant Sci. 2024 Aug 9;12(6):e11611. doi: 10.1002/aps3.11611. eCollection 2024 Nov-Dec.
There is a general lack of consensus on the best practices for filtering of single-nucleotide polymorphisms (SNPs) and whether it is better to use SNPs or include flanking regions (full "locus") in phylogenomic analyses and subsequent comparative methods.
Using genotyping-by-sequencing data from 22 species, we assessed the effects of SNP vs. locus usage and SNP retention stringency. We compared branch length, node support, and divergence time estimation across 16 datasets with varying amounts of missing data and total size.
Our results revealed five aspects of phylogenomic data usage that may be generally applicable: (1) tree topology is largely congruent across analyses; (2) filtering strictly for SNP retention (e.g., 90-100%) reduces support and can alter some inferred relationships; (3) absolute branch lengths vary by two orders of magnitude between SNP and locus datasets; (4) data type and branch length variation have little effect on divergence time estimation; and (5) phylograms alter the estimation of ancestral states and rates of morphological evolution.
Using SNP or locus datasets does not alter phylogenetic inference significantly, unless researchers want or need to use absolute branch lengths. We recommend against using excessive filtering thresholds for SNP retention to reduce the risk of producing inconsistent topologies and generating low support.
对于单核苷酸多态性(SNP)的筛选最佳实践,以及在系统发育基因组分析和后续比较方法中使用SNP还是纳入侧翼区域(完整“基因座”)更好,目前普遍缺乏共识。
利用来自22个物种的测序分型数据,我们评估了使用SNP与基因座以及SNP保留严格度的影响。我们比较了16个具有不同缺失数据量和总大小的数据集的分支长度、节点支持度和分歧时间估计。
我们的结果揭示了系统发育基因组数据使用的五个可能普遍适用的方面:(1)各分析的树拓扑结构在很大程度上是一致的;(2)严格筛选SNP保留(例如90 - 100%)会降低支持度,并可能改变一些推断的关系;(3)SNP和基因座数据集之间的绝对分支长度相差两个数量级;(4)数据类型和分支长度变化对分歧时间估计影响不大;(5)系统发育树状图会改变祖先状态和形态进化速率的估计。
使用SNP或基因座数据集不会显著改变系统发育推断,除非研究人员想要或需要使用绝对分支长度。我们建议不要对SNP保留使用过高的筛选阈值,以降低产生不一致拓扑结构和低支持度的风险。