Alanko Jarno, Bannai Hideo, Cazaux Bastien, Peterlongo Pierre, Stoye Jens
1Department of Computer Science, University of Helsinki, Helsinki, Finland.
2Department of Informatics, Kyushu University, Fukuoka, Japan.
Algorithms Mol Biol. 2020 Feb 10;15:2. doi: 10.1186/s13015-020-0163-6. eCollection 2020.
Recent large-scale community sequencing efforts allow at an unprecedented level of detail the identification of genomic regions that show signatures of natural selection. Traditional methods for identifying such regions from individuals' haplotype data, however, require excessive computing times and therefore are not applicable to current datasets. In 2019, Cunha et al. (Advances in bioinformatics and computational biology: 11th Brazilian symposium on bioinformatics, BSB 2018, Niterói, Brazil, October 30 - November 1, 2018, Proceedings, 2018. 10.1007/978-3-030-01722-4_3) suggested the as a very simple combinatorial pattern, forming the basis of a new method to perform rapid genome-wide selection scans. The algorithm they presented for identifying these blocks, however, had a worst-case running time quadratic in the genome length. It was posed as an open problem whether an optimal, linear-time algorithm exists. In this paper we give two algorithms that achieve this time bound, one conceptually very simple one using suffix trees and a second one using the positional Burrows-Wheeler Transform, that is very efficient also in practice.
近期大规模的群体测序工作使得以前所未有的详细程度识别出显示自然选择特征的基因组区域成为可能。然而,从个体单倍型数据中识别此类区域的传统方法需要耗费大量计算时间,因此不适用于当前的数据集。2019年,库尼亚等人(《生物信息学与计算生物学进展:第11届巴西生物信息学研讨会,2018年巴西尼特罗伊,2018年10月30日至11月1日,会议录,2018. 10.1007/978 - 3 - 030 - 01722 - 4_3》)提出了一种非常简单的组合模式,作为一种执行全基因组快速选择扫描新方法的基础。然而,他们提出的用于识别这些区域的算法,其最坏情况运行时间在基因组长度上是二次方的。是否存在最优的线性时间算法还是一个未解决的问题。在本文中,我们给出了两种达到该时间界限的算法,一种在概念上非常简单,使用后缀树,另一种使用位置布罗算法,在实践中也非常高效。