Hub de Bioinformatique et Biostatistique - Département Biologie Computationnelle, Institut Pasteur, USR 3756, CNRS, 75015 Paris, France.
F1000Res. 2020 Nov 10;9:1309. doi: 10.12688/f1000research.26930.1. eCollection 2020.
Recently developed MinHash-based techniques were proven successful in quickly estimating the level of similarity between large nucleotide sequences. This article discusses their usage and limitations in practice to approximating uncorrected distances between genomes, and transforming these pairwise dissimilarities into proper evolutionary distances. It is notably shown that complex distance measures can be easily approximated using simple transformation formulae based on few parameters. MinHash-based techniques can therefore be very useful for implementing fast yet accurate alignment-free phylogenetic reconstruction procedures from large sets of genomes. This last point of view is assessed with a simulation study using a dedicated bioinformatics tool.
最近开发的基于 MinHash 的技术已被证明可成功快速估计大型核苷酸序列之间的相似性水平。本文讨论了它们在实践中的用途和局限性,以近似基因组之间未经校正的距离,并将这些成对的不相似性转化为适当的进化距离。值得注意的是,可以使用基于少数参数的简单变换公式轻松近似复杂的距离度量。因此,基于 MinHash 的技术对于从大型基因组集中实现快速而准确的无比对系统发育重建程序非常有用。最后,使用专门的生物信息学工具进行模拟研究来评估这一观点。