• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种将单倍型建模为参考单倍型镶嵌体的随机森林框架。

A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes.

作者信息

Faux Pierre, Geurts Pierre, Druet Tom

机构信息

Unit of Animal Genomics, GIGA-R, Faculty of Veterinary Medicine, University of Liège, Liège, Belgium.

Department of Electrical Engineering and Computer Science, Montefiore Institute, University of Liège, Liège, Belgium.

出版信息

Front Genet. 2019 Jun 27;10:562. doi: 10.3389/fgene.2019.00562. eCollection 2019.

DOI:10.3389/fgene.2019.00562
PMID:31316542
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6610336/
Abstract

Many genomic data analyses such as phasing, genotype imputation, or local ancestry inference share a common core task: matching pairs of haplotypes at any position along the chromosome, thereby inferring a target haplotype as a succession of pieces from reference haplotypes, commonly called a mosaic of reference haplotypes. For that purpose, these analyses combine information provided by linkage disequilibrium, linkage and/or genealogy through a set of heuristic rules or, most often, by a hidden Markov model. Here, we develop an extremely randomized trees framework to address the issue of local haplotype matching. In our approach, a supervised classifier using extra-trees (a particular type of random forests) learns how to identify the best local matches between haplotypes using a collection of observed examples. For each example, various features related to the different sources of information are observed, such as the length of a segment shared between haplotypes, or estimates of relationships between individuals, gametes, and haplotypes. The random forests framework was fed with 30 relevant features for local haplotype matching. Repeated cross-validations allowed ranking these features in regard to their importance for local haplotype matching. The distance to the edge of a segment shared by both haplotypes being matched was found to be the most important feature. Similarity comparisons between predicted and true whole-genome sequence haplotypes showed that the random forests framework was more efficient than a hidden Markov model in reconstructing a target haplotype as a mosaic of reference haplotypes. To further evaluate its efficiency, the random forests framework was applied to imputation of whole-genome sequence from 50k genotypes and it yielded average reliabilities similar or slightly better than IMPUTE2. Through this exploratory study, we lay the foundations of a new framework to automatically learn local haplotype matching and we show that extra-trees are a promising approach for such purposes. The use of this new technique also reveals some useful lessons on the relevant features for the purpose of haplotype matching. We also discuss potential improvements for routine implementation.

摘要

许多基因组数据分析,如定相、基因型填充或局部祖先推断,都有一个共同的核心任务:在染色体上的任何位置匹配单倍型对,从而将目标单倍型推断为由参考单倍型片段组成的序列,通常称为参考单倍型镶嵌体。为此,这些分析通过一组启发式规则,或者最常见的是通过隐马尔可夫模型,来整合连锁不平衡、连锁和/或谱系提供的信息。在这里,我们开发了一个极端随机树框架来解决局部单倍型匹配问题。在我们的方法中,一个使用极端随机树(一种特殊类型的随机森林)的监督分类器,通过观察到的示例集合来学习如何识别单倍型之间的最佳局部匹配。对于每个示例,观察到与不同信息源相关的各种特征,例如单倍型之间共享片段的长度,或者个体、配子和单倍型之间关系的估计值。随机森林框架使用了30个与局部单倍型匹配相关的特征。重复交叉验证允许根据这些特征对局部单倍型匹配的重要性进行排序。发现与两个匹配单倍型共享片段边缘的距离是最重要的特征。预测的和真实的全基因组序列单倍型之间的相似性比较表明,在将目标单倍型重建为参考单倍型镶嵌体方面,随机森林框架比隐马尔可夫模型更有效。为了进一步评估其效率,将随机森林框架应用于从50k基因型进行全基因组序列填充,其产生的平均可靠性与IMPUTE²相似或略好。通过这项探索性研究,我们奠定了一个自动学习局部单倍型匹配新框架的基础,并且表明极端随机树是用于此目的的一种有前途的方法。这项新技术的使用还揭示了一些关于单倍型匹配相关特征的有用经验。我们还讨论了常规实施的潜在改进。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/237a/6610336/3e05be14b8a4/fgene-10-00562-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/237a/6610336/a84d8dc15331/fgene-10-00562-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/237a/6610336/da1412ff64e3/fgene-10-00562-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/237a/6610336/80701b4e918b/fgene-10-00562-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/237a/6610336/a11ce192bf65/fgene-10-00562-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/237a/6610336/6c878ed10b39/fgene-10-00562-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/237a/6610336/3e05be14b8a4/fgene-10-00562-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/237a/6610336/a84d8dc15331/fgene-10-00562-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/237a/6610336/da1412ff64e3/fgene-10-00562-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/237a/6610336/80701b4e918b/fgene-10-00562-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/237a/6610336/a11ce192bf65/fgene-10-00562-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/237a/6610336/6c878ed10b39/fgene-10-00562-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/237a/6610336/3e05be14b8a4/fgene-10-00562-g006.jpg

相似文献

1
A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes.一种将单倍型建模为参考单倍型镶嵌体的随机森林框架。
Front Genet. 2019 Jun 27;10:562. doi: 10.3389/fgene.2019.00562. eCollection 2019.
2
A strategy to improve phasing of whole-genome sequenced individuals through integration of familial information from dense genotype panels.一种通过整合来自密集基因型面板的家族信息来改善全基因组测序个体相位的策略。
Genet Sel Evol. 2017 May 16;49(1):46. doi: 10.1186/s12711-017-0321-6.
3
A spatial haplotype copying model with applications to genotype imputation.一种应用于基因型填充的空间单倍型复制模型。
J Comput Biol. 2015 May;22(5):451-62. doi: 10.1089/cmb.2014.0151. Epub 2014 Dec 19.
4
Modeling coverage gaps in haplotype frequencies via Bayesian inference to improve stem cell donor selection.通过贝叶斯推断建模单倍型频率的覆盖缺口,以改进干细胞供体选择。
Immunogenetics. 2018 May;70(5):279-292. doi: 10.1007/s00251-017-1040-4. Epub 2017 Nov 9.
5
Hap-seqX: expedite algorithm for haplotype phasing with imputation using sequence data.Hap-seqX:使用序列数据进行导入的单倍型相位加速算法。
Gene. 2013 Apr 10;518(1):2-6. doi: 10.1016/j.gene.2012.11.093. Epub 2012 Dec 23.
6
Hap-seq: an optimal algorithm for haplotype phasing with imputation using sequencing data.Hap-seq:一种利用测序数据进行单倍型定相及插补的优化算法。
J Comput Biol. 2013 Feb;20(2):80-92. doi: 10.1089/cmb.2012.0091.
7
Little loss of information due to unknown phase for fine-scale linkage-disequilibrium mapping with single-nucleotide-polymorphism genotype data.利用单核苷酸多态性基因型数据进行精细尺度连锁不平衡定位时,因未知相位导致的信息损失较小。
Am J Hum Genet. 2004 May;74(5):945-53. doi: 10.1086/420773. Epub 2004 Apr 7.
8
Minimal Positional Substring Cover: A Haplotype Threading Alternative to Li & Stephens Model.最小位置子串覆盖:一种替代李和斯蒂芬斯模型的单倍型穿线法
bioRxiv. 2023 Jan 6:2023.01.04.522803. doi: 10.1101/2023.01.04.522803.
9
Haplotype inference using a Bayesian Hidden Markov model.使用贝叶斯隐马尔可夫模型进行单倍型推断。
Genet Epidemiol. 2007 Dec;31(8):937-48. doi: 10.1002/gepi.20253.
10
Minimal positional substring cover is a haplotype threading alternative to Li and Stephens model.最小位置子串覆盖是替代 Li 和 Stephens 模型的单倍型连接方法。
Genome Res. 2023 Jul;33(7):1007-1014. doi: 10.1101/gr.277673.123. Epub 2023 Jun 14.

引用本文的文献

1
Application of Genomic Big Data in Plant Breeding:Past, Present, and Future.基因组大数据在植物育种中的应用:过去、现在与未来
Plants (Basel). 2020 Oct 28;9(11):1454. doi: 10.3390/plants9111454.

本文引用的文献

1
A strategy to improve phasing of whole-genome sequenced individuals through integration of familial information from dense genotype panels.一种通过整合来自密集基因型面板的家族信息来改善全基因组测序个体相位的策略。
Genet Sel Evol. 2017 May 16;49(1):46. doi: 10.1186/s12711-017-0321-6.
2
NGS-based reverse genetic screen for common embryonic lethal mutations compromising fertility in livestock.基于二代测序的反向遗传筛选,用于检测影响家畜繁殖力的常见胚胎致死突变。
Genome Res. 2016 Oct;26(10):1333-1341. doi: 10.1101/gr.207076.116. Epub 2016 Sep 19.
3
Reconstruction of Genome Ancestry Blocks in Multiparental Populations.
多亲本群体中基因组祖先片段的重建
Genetics. 2015 Aug;200(4):1073-87. doi: 10.1534/genetics.115.177873. Epub 2015 Jun 4.
4
Machine learning applications in genetics and genomics.机器学习在遗传学和基因组学中的应用。
Nat Rev Genet. 2015 Jun;16(6):321-32. doi: 10.1038/nrg3920. Epub 2015 May 7.
5
Relatedness in the post-genomic era: is it still useful?后基因组时代的相关性:它还有用吗?
Nat Rev Genet. 2015 Jan;16(1):33-44. doi: 10.1038/nrg3821. Epub 2014 Nov 18.
6
A new approach for efficient genotype imputation using information from relatives.一种利用亲属信息进行高效基因型插补的新方法。
BMC Genomics. 2014 Jun 17;15(1):478. doi: 10.1186/1471-2164-15-478.
7
RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference.RFMix:一种用于快速稳健的局部祖源推断的判别建模方法。
Am J Hum Genet. 2013 Aug 8;93(2):278-88. doi: 10.1016/j.ajhg.2013.06.020. Epub 2013 Aug 1.
8
Fast and accurate inference of local ancestry in Latino populations.快速准确推断拉丁裔人群的局部血统。
Bioinformatics. 2012 May 15;28(10):1359-67. doi: 10.1093/bioinformatics/bts144. Epub 2012 Apr 11.
9
Inference of population structure using dense haplotype data.利用高密度单倍型数据推断种群结构。
PLoS Genet. 2012 Jan;8(1):e1002453. doi: 10.1371/journal.pgen.1002453. Epub 2012 Jan 26.
10
A linear complexity phasing method for thousands of genomes.一种用于数千个基因组的线性复杂度相位分析方法。
Nat Methods. 2011 Dec 4;9(2):179-81. doi: 10.1038/nmeth.1785.