Hubley Robert, Wheeler Travis J, Smit Arian F A
Institute for Systems Biology, Seattle, WA 98109, USA.
Department of Computer Science, University of Montana, Missoula, MT 59801, USA.
NAR Genom Bioinform. 2022 May 17;4(2):lqac040. doi: 10.1093/nargab/lqac040. eCollection 2022 Jun.
The construction of a high-quality multiple sequence alignment (MSA) from copies of a transposable element (TE) is a critical step in the characterization of a new TE family. Most studies of MSA accuracy have been conducted on protein or RNA sequence families, where structural features and strong signals of selection may assist with alignment. Less attention has been given to the quality of sequence alignments involving neutrally evolving DNA sequences such as those resulting from TE replication. Transposable element sequences are challenging to align due to their wide divergence ranges, fragmentation, and predominantly-neutral mutation patterns. To gain insight into the effects of these properties on MSA accuracy, we developed a simulator of TE sequence evolution, and used it to generate a benchmark with which we evaluated the MSA predictions produced by several popular aligners, along with Refiner, a method we developed in the context of our RepeatModeler software. We find that MAFFT and Refiner generally outperform other aligners for low to medium divergence simulated sequences, while Refiner is uniquely effective when tasked with aligning high-divergent and fragmented instances of a family.
从转座元件(TE)的拷贝构建高质量的多序列比对(MSA)是鉴定新TE家族的关键步骤。大多数关于MSA准确性的研究是在蛋白质或RNA序列家族上进行的,其中结构特征和强烈的选择信号可能有助于比对。对于涉及中性进化DNA序列(如TE复制产生的序列)的序列比对质量,关注较少。转座元件序列由于其广泛的分歧范围、片段化和主要为中性的突变模式,难以进行比对。为了深入了解这些特性对MSA准确性的影响,我们开发了一个TE序列进化模拟器,并用它生成了一个基准,我们用这个基准评估了几种流行比对工具以及我们在RepeatModeler软件中开发的Refiner方法所产生的MSA预测。我们发现,对于低到中等分歧的模拟序列,MAFFT和Refiner通常比其他比对工具表现更好,而当任务是比对一个家族的高分歧和片段化实例时,Refiner具有独特的有效性。