Department of Biology, Stanford University, Stanford, California 94305, USA.
Genome Res. 2011 Jun;21(6):863-74. doi: 10.1101/gr.115949.110. Epub 2011 Mar 10.
We investigate the effect of aligner choice on inferences of positive selection using site-specific models of molecular evolution. We find that independently of the choice of aligner, the rate of false positives is unacceptably high. Our study is a whole-genome analysis of all protein-coding genes in 12 Drosophila genomes annotated in either all 12 species (~6690 genes) or in the six melanogaster group species. We compare six popular aligners: PRANK, T-Coffee, ClustalW, ProbCons, AMAP, and MUSCLE, and find that the aligner choice strongly influences the estimates of positive selection. Differences persist when we use (1) different stringency cutoffs, (2) different selection inference models, (3) alignments with or without gaps, and/or additional masking, (4) per-site versus per-gene statistics, (5) closely related melanogaster group species versus more distant 12 Drosophila genomes. Furthermore, we find that these differences are consequential for downstream analyses such as determination of over/under-represented GO terms associated with positive selection. Visual analysis indicates that most sites inferred as positively selected are, in fact, misaligned at the codon level, resulting in false positive rates of 48%-82%. PRANK, which has been reported to outperform other aligners in simulations, performed best in our empirical study as well. Unfortunately, PRANK still had a high, and unacceptable for most applications, false positives rate of 50%-55%. We identify misannotations and indels, many of which appear to be located in disordered protein regions, as primary culprits for the high misalignment-related error levels and discuss possible workaround approaches to this apparently pervasive problem in genome-wide evolutionary analyses.
我们研究了不同Aligner 对基于分子进化的局部模型的正选择推断的影响。结果发现,无论Aligner 的选择如何,假阳性率都高得无法接受。我们的研究是对 12 个果蝇全基因组注释的所有编码蛋白基因进行的全基因组分析,涵盖了 12 个物种的所有基因(~6690 个基因)或 6 个黑腹果蝇组物种的所有基因。我们比较了 6 种流行的 Aligner:PRANK、T-Coffee、ClustalW、ProbCons、AMAP 和 MUSCLE,发现 Aligner 的选择强烈影响正选择的估计。当我们使用(1)不同的严格性截断值,(2)不同的选择推断模型,(3)有或没有空位的比对,和/或额外的屏蔽,(4)每个位点与每个基因的统计数据,(5)密切相关的黑腹果蝇组物种与更远的 12 个果蝇基因组时,差异仍然存在。此外,我们发现这些差异对下游分析有重要影响,例如确定与正选择相关的过/欠代表观遗传学术语。可视化分析表明,大多数被推断为正选择的位点实际上在密码子水平上是错配的,导致假阳性率为 48%-82%。PRANK 在模拟中被报道表现优于其他 Aligner,在我们的实证研究中也表现最好。不幸的是,PRANK 仍然有一个很高的,对于大多数应用来说无法接受的假阳性率,高达 50%-55%。我们确定了错义注释和插入缺失,其中许多似乎位于无序蛋白区域,这是导致高错配相关错误水平的主要原因,并讨论了在全基因组进化分析中解决这个明显普遍问题的可能方法。