Novosibirsk State University, Pirogova, 1, Novosibirsk, 630090, Russia.
Wellcome Trust Sanger Institute, Cambridge, UK.
BMC Genomics. 2018 Feb 9;19(Suppl 3):92. doi: 10.1186/s12864-018-4475-6.
The use of artificial data to evaluate the performance of aligners and peak callers not only improves its accuracy and reliability, but also makes it possible to reduce the computational time. One of the natural ways to achieve such time reduction is by mapping a single chromosome.
We investigated whether a single chromosome mapping causes any artefacts in the alignments' performances. In this paper, we compared the accuracy of the performance of seven aligners on well-controlled simulated benchmark data which was sampled from a single chromosome and also from a whole genome. We found that commonly used statistical methods are insufficient to evaluate an aligner performance, and applied a novel measure of a read density distribution similarity, which allowed to reveal artefacts in aligners' performances. We also calculated some interesting mismatch statistics, and constructed mismatch frequency distributions along the read.
The generation of artificial data by mapping of reads generated from a single chromosome to a reference chromosome is justified from the point of view of reducing the benchmarking time. The proposed quality assessment method allows to identify the inherent shortcoming of aligners that are not detected by conventional statistical methods, and can affect the quality of alignment of real data.
使用人工数据来评估Aligner 和 Peak Caller 的性能不仅可以提高其准确性和可靠性,还可以减少计算时间。实现这种时间减少的一种自然方法是映射单个染色体。
我们研究了单染色体映射是否会对配准性能产生任何伪影。在本文中,我们比较了七种Aligner 在经过良好控制的模拟基准数据上的性能准确性,这些数据是从单个染色体和整个基因组中采样的。我们发现,常用的统计方法不足以评估Aligner 的性能,因此我们应用了一种新的读取密度分布相似性度量方法,该方法可以揭示Aligner 性能中的伪影。我们还计算了一些有趣的错配统计数据,并沿着读取构建了错配频率分布。
从减少基准测试时间的角度来看,通过将来自单个染色体的读取映射到参考染色体上来生成人工数据是合理的。所提出的质量评估方法可以识别常规统计方法无法检测到的Aligner 的固有缺陷,并且会影响真实数据的对齐质量。