Faux Pierre, Geurts Pierre, Druet Tom
Unit of Animal Genomics, GIGA-R, Faculty of Veterinary Medicine, University of Liège, Liège, Belgium.
Department of Electrical Engineering and Computer Science, Montefiore Institute, University of Liège, Liège, Belgium.
Front Genet. 2019 Jun 27;10:562. doi: 10.3389/fgene.2019.00562. eCollection 2019.
Many genomic data analyses such as phasing, genotype imputation, or local ancestry inference share a common core task: matching pairs of haplotypes at any position along the chromosome, thereby inferring a target haplotype as a succession of pieces from reference haplotypes, commonly called a mosaic of reference haplotypes. For that purpose, these analyses combine information provided by linkage disequilibrium, linkage and/or genealogy through a set of heuristic rules or, most often, by a hidden Markov model. Here, we develop an extremely randomized trees framework to address the issue of local haplotype matching. In our approach, a supervised classifier using extra-trees (a particular type of random forests) learns how to identify the best local matches between haplotypes using a collection of observed examples. For each example, various features related to the different sources of information are observed, such as the length of a segment shared between haplotypes, or estimates of relationships between individuals, gametes, and haplotypes. The random forests framework was fed with 30 relevant features for local haplotype matching. Repeated cross-validations allowed ranking these features in regard to their importance for local haplotype matching. The distance to the edge of a segment shared by both haplotypes being matched was found to be the most important feature. Similarity comparisons between predicted and true whole-genome sequence haplotypes showed that the random forests framework was more efficient than a hidden Markov model in reconstructing a target haplotype as a mosaic of reference haplotypes. To further evaluate its efficiency, the random forests framework was applied to imputation of whole-genome sequence from 50k genotypes and it yielded average reliabilities similar or slightly better than IMPUTE2. Through this exploratory study, we lay the foundations of a new framework to automatically learn local haplotype matching and we show that extra-trees are a promising approach for such purposes. The use of this new technique also reveals some useful lessons on the relevant features for the purpose of haplotype matching. We also discuss potential improvements for routine implementation.
许多基因组数据分析,如定相、基因型填充或局部祖先推断,都有一个共同的核心任务:在染色体上的任何位置匹配单倍型对,从而将目标单倍型推断为由参考单倍型片段组成的序列,通常称为参考单倍型镶嵌体。为此,这些分析通过一组启发式规则,或者最常见的是通过隐马尔可夫模型,来整合连锁不平衡、连锁和/或谱系提供的信息。在这里,我们开发了一个极端随机树框架来解决局部单倍型匹配问题。在我们的方法中,一个使用极端随机树(一种特殊类型的随机森林)的监督分类器,通过观察到的示例集合来学习如何识别单倍型之间的最佳局部匹配。对于每个示例,观察到与不同信息源相关的各种特征,例如单倍型之间共享片段的长度,或者个体、配子和单倍型之间关系的估计值。随机森林框架使用了30个与局部单倍型匹配相关的特征。重复交叉验证允许根据这些特征对局部单倍型匹配的重要性进行排序。发现与两个匹配单倍型共享片段边缘的距离是最重要的特征。预测的和真实的全基因组序列单倍型之间的相似性比较表明,在将目标单倍型重建为参考单倍型镶嵌体方面,随机森林框架比隐马尔可夫模型更有效。为了进一步评估其效率,将随机森林框架应用于从50k基因型进行全基因组序列填充,其产生的平均可靠性与IMPUTE²相似或略好。通过这项探索性研究,我们奠定了一个自动学习局部单倍型匹配新框架的基础,并且表明极端随机树是用于此目的的一种有前途的方法。这项新技术的使用还揭示了一些关于单倍型匹配相关特征的有用经验。我们还讨论了常规实施的潜在改进。