Suppr超能文献

针对非编码序列多重比对的现实基准。

Towards realistic benchmarks for multiple alignments of non-coding sequences.

机构信息

Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.

出版信息

BMC Bioinformatics. 2010 Jan 26;11:54. doi: 10.1186/1471-2105-11-54.

Abstract

BACKGROUND

With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Simulation-based benchmarks have been proposed to meet this necessity, especially for non-coding sequences. However, it is not clear if such benchmarks truly represent real sequence data from any given group of species, in terms of the difficulty of alignment tasks.

RESULTS

We find that the conventional simulation approach, which relies on empirically estimated values for various parameters such as substitution rate or insertion/deletion rates, is unable to generate synthetic sequences reflecting the broad genomic variation in conservation levels. We tackle this problem with a new method for simulating non-coding sequence evolution, by relying on genome-wide distributions of evolutionary parameters rather than their averages. We then generate synthetic data sets to mimic orthologous sequences from the Drosophila group of species, and show that these data sets truly represent the variability observed in genomic data in terms of the difficulty of the alignment task. This allows us to make significant progress towards estimating the alignment accuracy of current tools in an absolute sense, going beyond only a relative assessment of different tools. We evaluate six widely used multiple alignment tools in the context of Drosophila non-coding sequences, and find the accuracy to be significantly different from previously reported values. Interestingly, the performance of most tools degrades more rapidly when there are more insertions than deletions in the data set, suggesting an asymmetric handling of insertions and deletions, even though none of the evaluated tools explicitly distinguishes these two types of events. We also examine the accuracy of two existing tools for annotating insertions versus deletions, and find their performance to be close to optimal in Drosophila non-coding sequences if provided with the true alignments.

CONCLUSION

We have developed a method to generate benchmarks for multiple alignments of Drosophila non-coding sequences, and shown it to be more realistic than traditional benchmarks. Apart from helping to select the most effective tools, these benchmarks will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors.

摘要

背景

随着用于多序列比对的新计算工具的不断发展,今天有必要开发有助于选择最有效工具的基准。已经提出了基于模拟的基准来满足这一需求,特别是对于非编码序列。然而,就比对任务的难度而言,这些基准是否真的代表了来自任何特定物种组的真实序列数据,这一点尚不清楚。

结果

我们发现,传统的模拟方法,依赖于对各种参数(如替换率或插入/缺失率)的经验估计值,无法生成反映保守水平广泛基因组变异的合成序列。我们通过依赖于进化参数的全基因组分布而不是它们的平均值来解决这个问题,从而为模拟非编码序列进化提供了一种新方法。然后,我们生成模拟数据集来模拟来自果蝇组的同源序列,并表明这些数据集在比对任务的难度方面确实代表了在基因组数据中观察到的可变性。这使我们能够朝着在绝对意义上估计当前工具的比对精度的方向取得重大进展,而不仅仅是对不同工具的相对评估。我们在果蝇非编码序列的背景下评估了六种广泛使用的多序列比对工具,并发现准确性与之前报告的值有显著差异。有趣的是,当数据集的插入比删除多时,大多数工具的性能下降得更快,这表明即使没有评估的工具明确区分这两种类型的事件,也存在对插入和删除的不对称处理。我们还检查了两种用于注释插入与删除的现有工具的准确性,并发现如果提供真实的比对,它们在果蝇非编码序列中的性能接近最佳。

结论

我们开发了一种生成果蝇非编码序列多序列比对基准的方法,并表明它比传统基准更现实。除了帮助选择最有效的工具外,这些基准还将通过提供这些错误的准确估计,帮助比较基因组学的从业者处理比对错误的影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7856/2823711/3e5f139f71eb/1471-2105-11-54-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验