• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用基因型数据在大型数据库中快速识别相同和密切相关的个体。

Quickly identifying identical and closely related subjects in large databases using genotype data.

作者信息

Jin Yumi, Schäffer Alejandro A, Sherry Stephen T, Feolo Michael

机构信息

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America.

出版信息

PLoS One. 2017 Jun 13;12(6):e0179106. doi: 10.1371/journal.pone.0179106. eCollection 2017.

DOI:10.1371/journal.pone.0179106
PMID:28609482
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5469481/
Abstract

Genome-wide association studies (GWAS) usually rely on the assumption that different samples are not from closely related individuals. Detection of duplicates and close relatives becomes more difficult both statistically and computationally when one wants to combine datasets that may have been genotyped on different platforms. The dbGaP repository at the National Center of Biotechnology Information (NCBI) contains datasets from hundreds of studies with over one million samples. There are many duplicates and closely related individuals both within and across studies from different submitters. Relationships between studies cannot always be identified by the submitters of individual datasets. To aid in curation of dbGaP, we developed a rapid statistical method called Genetic Relationship and Fingerprinting (GRAF) to detect duplicates and closely related samples, even when the sets of genotyped markers differ and the DNA strand orientations are unknown. GRAF extracts genotypes of 10,000 informative and independent SNPs from genotype datasets obtained using different methods, and implements quick algorithms that enable it to find all of the duplicate pairs from more than 880,000 samples within and across dbGaP studies in less than two hours. In addition, GRAF uses two statistical metrics called All Genotype Mismatch Rate (AGMR) and Homozygous Genotype Mismatch Rate (HGMR) to determine subject relationships directly from the observed genotypes, without estimating probabilities of identity by descent (IBD), or kinship coefficients, and compares the predicted relationships with those reported in the pedigree files. We implemented GRAF in a freely available C++ program of the same name. In this paper, we describe the methods in GRAF and validate the usage of GRAF on samples from the dbGaP repository. Other scientists can use GRAF on their own samples and in combination with samples downloaded from dbGaP.

摘要

全基因组关联研究(GWAS)通常依赖于不同样本并非来自密切相关个体的假设。当人们想要合并可能在不同平台上进行基因分型的数据集时,无论是在统计上还是计算上,检测重复样本和近亲都变得更加困难。美国国立生物技术信息中心(NCBI)的dbGaP数据库包含来自数百项研究的数据集,样本超过100万个。不同提交者的研究内部和研究之间都存在许多重复样本和密切相关的个体。单个数据集的提交者并不总能识别研究之间的关系。为了协助管理dbGaP,我们开发了一种名为遗传关系与指纹识别(GRAF)的快速统计方法,用于检测重复样本和密切相关的样本,即使基因分型标记集不同且DNA链方向未知。GRAF从使用不同方法获得的基因型数据集中提取10000个信息丰富且独立的单核苷酸多态性(SNP)的基因型,并实现快速算法,使其能够在不到两小时的时间内从dbGaP研究内部和研究之间的88万多个样本中找到所有重复对。此外,GRAF使用两个统计指标,即全基因型错配率(AGMR)和纯合基因型错配率(HGMR),直接从观察到的基因型确定样本关系,而无需估计同源系数(IBD)或亲缘系数的概率,并将预测的关系与系谱文件中报告的关系进行比较。我们用同名的免费C++程序实现了GRAF。在本文中,我们描述了GRAF中的方法,并在来自dbGaP数据库的样本上验证了GRAF的用法。其他科学家可以在他们自己的样本上使用GRAF,并与从dbGaP下载的样本结合使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d105/5469481/2f32d455ea21/pone.0179106.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d105/5469481/acc96fc46abd/pone.0179106.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d105/5469481/9db9936b856f/pone.0179106.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d105/5469481/8ca858f7e337/pone.0179106.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d105/5469481/a587f1ca932b/pone.0179106.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d105/5469481/31e0d22948c5/pone.0179106.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d105/5469481/0f43b60a9cb6/pone.0179106.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d105/5469481/2f32d455ea21/pone.0179106.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d105/5469481/acc96fc46abd/pone.0179106.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d105/5469481/9db9936b856f/pone.0179106.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d105/5469481/8ca858f7e337/pone.0179106.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d105/5469481/a587f1ca932b/pone.0179106.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d105/5469481/31e0d22948c5/pone.0179106.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d105/5469481/0f43b60a9cb6/pone.0179106.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d105/5469481/2f32d455ea21/pone.0179106.g007.jpg

相似文献

1
Quickly identifying identical and closely related subjects in large databases using genotype data.利用基因型数据在大型数据库中快速识别相同和密切相关的个体。
PLoS One. 2017 Jun 13;12(6):e0179106. doi: 10.1371/journal.pone.0179106. eCollection 2017.
2
GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis.GRAF-pop:一种无需主成分分析即可基于距离推断个体祖先的快速方法,适用于多种基因型数据集。
G3 (Bethesda). 2019 Aug 8;9(8):2447-2461. doi: 10.1534/g3.118.200925.
3
Assessing consistency between versions of genotype-calling algorithm Birdseed for the Genome-Wide Human SNP Array 6.0 using HapMap samples.使用 HapMap 样本评估 Genome-Wide Human SNP Array 6.0 的基因型调用算法 Birdseed 的不同版本之间的一致性。
Adv Exp Med Biol. 2010;680:355-60. doi: 10.1007/978-1-4419-5913-3_40.
4
Inference of relationships in population data using identity-by-descent and identity-by-state.利用血缘关系和基因状态推断群体数据中的关系。
PLoS Genet. 2011 Sep;7(9):e1002287. doi: 10.1371/journal.pgen.1002287. Epub 2011 Sep 22.
5
Highly scalable genotype phasing by entropy minimization.通过熵最小化实现高度可扩展的基因型定相
IEEE/ACM Trans Comput Biol Bioinform. 2008 Apr-Jun;5(2):252-61. doi: 10.1109/TCBB.2007.70223.
6
A Pipeline for Classifying Relationships Using Dense SNP/SNV Data and Putative Pedigree Information.一种使用密集单核苷酸多态性/单核苷酸变异数据和推定谱系信息对关系进行分类的流程。
Genet Epidemiol. 2016 Feb;40(2):161-71. doi: 10.1002/gepi.21948. Epub 2015 Dec 28.
7
Estimating the degree of identity by descent in consanguineous couples.估算同血缘夫妇的血缘相关度。
Hum Mutat. 2011 Dec;32(12):1350-8. doi: 10.1002/humu.21584. Epub 2011 Sep 23.
8
FIFS: A data mining method for informative marker selection in high dimensional population genomic data.FIFS:一种用于高维群体基因组数据中信息标记选择的数据挖掘方法。
Comput Biol Med. 2017 Nov 1;90:146-154. doi: 10.1016/j.compbiomed.2017.09.020. Epub 2017 Sep 28.
9
Identity by descent estimation with dense genome-wide genotype data.基于全基因组高密度基因型数据的亲缘关系估计。
Genet Epidemiol. 2011 Sep;35(6):557-67. doi: 10.1002/gepi.20606. Epub 2011 Jul 18.
10
Mining Plant Genomic and Genetic Data Using the GnpIS Information System.使用GnpIS信息系统挖掘植物基因组和遗传数据。
Methods Mol Biol. 2017;1533:103-117. doi: 10.1007/978-1-4939-6658-5_5.

引用本文的文献

1
Multi-trait Analysis of GWAS Expands Eosinophilic Esophagitis Genetic Susceptibility and Polygenic Risk Scores.全基因组关联研究的多性状分析扩展了嗜酸性食管炎的遗传易感性和多基因风险评分。
Res Sq. 2025 May 16:rs.3.rs-6630283. doi: 10.21203/rs.3.rs-6630283/v1.
2
Genomic alterations in normal breast tissues preceding breast cancer diagnosis.乳腺癌诊断前正常乳腺组织中的基因组改变。
Breast Cancer Res. 2025 Apr 22;27(1):60. doi: 10.1186/s13058-025-02018-5.
3
High-throughput proteomics identifies inflammatory proteins associated with disease severity and genetic ancestry in patients with hidradenitis suppurativa.

本文引用的文献

1
PADRE: Pedigree-Aware Distant-Relationship Estimation.PADRE:系谱感知远距离关系估计。
Am J Hum Genet. 2016 Jul 7;99(1):154-62. doi: 10.1016/j.ajhg.2016.05.020. Epub 2016 Jun 30.
2
Model-free Estimation of Recent Genetic Relatedness.近期遗传相关性的无模型估计
Am J Hum Genet. 2016 Jan 7;98(1):127-48. doi: 10.1016/j.ajhg.2015.11.022.
3
Second-generation PLINK: rising to the challenge of larger and richer datasets.第二代PLINK:应对更大、更丰富数据集的挑战
高通量蛋白质组学鉴定了化脓性汗腺炎患者中与疾病严重程度和遗传血统相关的炎症蛋白。
Br J Dermatol. 2025 May 19;192(6):1063-1071. doi: 10.1093/bjd/ljaf012.
4
Private detection of relatives in forensic genomics using homomorphic encryption.利用同态加密技术进行法医基因组学中的亲属私人检测。
BMC Med Genomics. 2024 Nov 19;17(1):273. doi: 10.1186/s12920-024-02037-9.
5
A framework for sharing of clinical and genetic data for precision medicine applications.用于精准医学应用的临床和遗传数据共享框架。
Nat Med. 2024 Dec;30(12):3578-3589. doi: 10.1038/s41591-024-03239-5. Epub 2024 Sep 3.
6
Social, Behavioral, and Clinical Risk Factors Are Associated with Clonal Hematopoiesis.社会、行为和临床风险因素与克隆性造血相关。
Cancer Epidemiol Biomarkers Prev. 2024 Nov 1;33(11):1423-1432. doi: 10.1158/1055-9965.EPI-24-0620.
7
Genetic risk factors for severe and fatigue dominant long COVID and commonalities with ME/CFS identified by combinatorial analysis.通过组合分析确定的严重和疲劳主导的长新冠的遗传风险因素,以及与 ME/CFS 的共同特征。
J Transl Med. 2023 Nov 1;21(1):775. doi: 10.1186/s12967-023-04588-4.
8
Genetically adjusted PSA levels for prostate cancer screening.前列腺癌筛查的基因调整 PSA 水平。
Nat Med. 2023 Jun;29(6):1412-1423. doi: 10.1038/s41591-023-02277-9. Epub 2023 Jun 1.
9
HostSeq: a Canadian whole genome sequencing and clinical data resource.宿主序列:加拿大全基因组测序和临床数据资源。
BMC Genom Data. 2023 May 2;24(1):26. doi: 10.1186/s12863-023-01128-3.
10
Genetic risk factors for ME/CFS identified using combinatorial analysis.使用组合分析鉴定的 ME/CFS 遗传风险因素。
J Transl Med. 2022 Dec 14;20(1):598. doi: 10.1186/s12967-022-03815-8.
Gigascience. 2015 Feb 25;4:7. doi: 10.1186/s13742-015-0047-8. eCollection 2015.
4
IPED2X: a robust pedigree reconstruction algorithm for complicated pedigrees.IPED2X:一种用于复杂谱系的稳健谱系重建算法。
J Bioinform Comput Biol. 2014 Dec;12(6):1442007. doi: 10.1142/S0219720014420074.
5
Accurate and robust prediction of genetic relationship from whole-genome sequences.基于全基因组序列对遗传关系进行准确且可靠的预测。
PLoS One. 2014 Feb 28;9(2):e85437. doi: 10.1371/journal.pone.0085437. eCollection 2014.
6
Relationship estimation from whole-genome sequence data.全基因组序列数据的关系估计。
PLoS Genet. 2014 Jan 30;10(1):e1004144. doi: 10.1371/journal.pgen.1004144. eCollection 2014 Jan.
7
Characterization and correction of error in genome-wide IBD estimation for samples with population structure.具有群体结构的全基因组 IBD 估计中的误差特征化和校正。
Genet Epidemiol. 2013 Sep;37(6):635-41. doi: 10.1002/gepi.21737. Epub 2013 Jun 5.
8
An integrated map of genetic variation from 1,092 human genomes.1092 个人类基因组遗传变异的综合图谱。
Nature. 2012 Nov 1;491(7422):56-65. doi: 10.1038/nature11632.
9
Estimating kinship in admixed populations.估算混合人群中的亲属关系。
Am J Hum Genet. 2012 Jul 13;91(1):122-38. doi: 10.1016/j.ajhg.2012.05.024. Epub 2012 Jun 28.
10
Identifying cryptic relationships.识别隐秘关系。
Methods Mol Biol. 2012;850:47-57. doi: 10.1007/978-1-61779-555-8_4.