由于亚结构和隐性关系导致的遗传异常值的识别。

Identification of genetic outliers due to sub-structure and cryptic relationships.

机构信息

Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA 02115, USA.

Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA 02115, USA.

出版信息

Bioinformatics. 2017 Jul 1;33(13):1972-1979. doi: 10.1093/bioinformatics/btx109.

DOI:10.1093/bioinformatics/btx109

PMID:28334167

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5870703/

Abstract

MOTIVATION

In order to minimize the effects of genetic confounding on the analysis of high-throughput genetic association studies, e.g. (whole-genome) sequencing (WGS) studies, genome-wide association studies (GWAS), etc., we propose a general framework to assess and to test formally for genetic heterogeneity among study subjects. As the approach fully utilizes the recent ancestor information captured by rare variants, it is especially powerful in WGS studies. Even for relatively moderate sample sizes, the proposed testing framework is able to identify study subjects that are genetically too similar, e.g. cryptic relationships, or that are genetically too different, e.g. population substructure. The approach is computationally fast, enabling the application to whole-genome sequencing data, and straightforward to implement.

RESULTS

Simulation studies illustrate the overall performance of our approach. In an application to the 1000 Genomes Project, we outline an analysis/cleaning pipeline that utilizes our approach to formally assess whether study subjects are related and whether population substructure is present. In the analysis of the 1000 Genomes Project data, our approach revealed subjects that are most likely related, but had previously passed standard qc-filters.

AVAILABILITY AND IMPLEMENTATION

An implementation of our method, Similarity Test for Estimating Genetic Outliers (STEGO), is available in the R package stego from Github at https://github.com/dschlauch/stego .

CONTACT

dschlauch@fas.harvard.edu.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

为了最大限度地减少遗传混杂对高通量遗传关联研究（例如全基因组测序 [WGS] 研究、全基因组关联研究 [GWAS] 等）分析的影响，我们提出了一种评估和正式检验研究对象之间遗传异质性的通用框架。由于该方法充分利用了稀有变异所捕获的近期祖先信息，因此在 WGS 研究中特别强大。即使对于相对适中的样本量，所提出的检验框架也能够识别遗传上过于相似的研究对象，例如隐匿关系，或者遗传上过于不同的研究对象，例如人口亚结构。该方法计算速度快，能够应用于全基因组测序数据，并且易于实现。

结果

模拟研究说明了我们方法的整体性能。在对 1000 个基因组计划的应用中，我们概述了一种分析/清理管道，该管道利用我们的方法正式评估研究对象是否相关，以及是否存在人口亚结构。在对 1000 个基因组计划数据的分析中，我们的方法揭示了最有可能相关但先前通过标准 QC 过滤器的对象。

可用性和实现

我们方法的实现，即用于估计遗传异常值的相似性检验（STEGO），可在 R 包 stego 中从 Github 获得，网址为 https://github.com/dschlauch/stego 。

联系方式

dschlauch@fas.harvard.edu。

补充信息

补充数据可在生物信息学在线获得。

相似文献

Identification of genetic outliers due to sub-structure and cryptic relationships.由于亚结构和隐性关系导致的遗传异常值的识别。

Bioinformatics. 2017 Jul 1;33(13):1972-1979. doi: 10.1093/bioinformatics/btx109.

A generalized association test based on U statistics.基于 U 统计量的广义关联检验。

Bioinformatics. 2017 Jul 1;33(13):1963-1971. doi: 10.1093/bioinformatics/btx103.

SVPV: a structural variant prediction viewer for paired-end sequencing datasets.SVPV：用于配对末端测序数据集的结构变异预测查看器。

Bioinformatics. 2017 Jul 1;33(13):2032-2033. doi: 10.1093/bioinformatics/btx117.

Very low-depth whole-genome sequencing in complex trait association studies.复杂性状关联研究中的极低深度全基因组测序。

Bioinformatics. 2019 Aug 1;35(15):2555-2561. doi: 10.1093/bioinformatics/bty1032.

Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project.利用杰卡德指数揭示测序数据中的群体分层：一项模拟研究及对千人基因组计划的应用

Bioinformatics. 2016 May 1;32(9):1366-72. doi: 10.1093/bioinformatics/btv752. Epub 2015 Dec 31.

SeqArray-a storage-efficient high-performance data format for WGS variant calls.SeqArray——一种用于全基因组测序变异检测的存储高效的高性能数据格式。

Bioinformatics. 2017 Aug 1;33(15):2251-2257. doi: 10.1093/bioinformatics/btx145.

svtools: population-scale analysis of structural variation.svtools：结构变异的大规模群体分析。

Bioinformatics. 2019 Nov 1;35(22):4782-4787. doi: 10.1093/bioinformatics/btz492.

Genome U-Plot: a whole genome visualization.基因组 U 形图：全基因组可视化。

Bioinformatics. 2018 May 15;34(10):1629-1634. doi: 10.1093/bioinformatics/btx829.

SV2: accurate structural variation genotyping and de novo mutation detection from whole genomes.SV2：全基因组中精确的结构变异基因分型和从头突变检测。

Bioinformatics. 2018 May 15;34(10):1774-1777. doi: 10.1093/bioinformatics/btx813.

Phylotyper: in silico predictor of gene subtypes. phylotyper：基因亚型的计算机预测器。

Bioinformatics. 2017 Nov 15;33(22):3638-3641. doi: 10.1093/bioinformatics/btx459.

引用本文的文献

Fast computation of the eigensystem of genomic similarity matrices.基因组相似性矩阵特征系统的快速计算

BMC Bioinformatics. 2024 Jan 25;25(1):43. doi: 10.1186/s12859-024-05650-8.

Limitations of principal components in quantitative genetic association models for human studies.主成分在人类研究定量遗传关联模型中的局限性。

Elife. 2023 May 4;12:e79238. doi: 10.7554/eLife.79238.

A comparison between similarity matrices for principal component analysis to assess population stratification in sequenced genetic data sets.主成分分析相似性矩阵比较，用于评估测序遗传数据集的群体分层。

Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac611.

Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest.无监督异常值检测应用于 SARS-CoV-2 核苷酸序列，可以识别常见变异序列和其他感兴趣的变异序列。

BMC Bioinformatics. 2022 Dec 19;23(1):547. doi: 10.1186/s12859-022-05105-y.

Genome-wide association analysis of COVID-19 mortality risk in SARS-CoV-2 genomes identifies mutation in the SARS-CoV-2 spike protein that colocalizes with P.1 of the Brazilian strain.全基因组关联分析 SARS-CoV-2 基因组中 COVID-19 死亡率风险，鉴定出与巴西变异株 P.1 共定位的 SARS-CoV-2 刺突蛋白突变。

Genet Epidemiol. 2021 Oct;45(7):685-693. doi: 10.1002/gepi.22421. Epub 2021 Jun 22.

Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus.对 SARS-CoV-2 基因组进行无监督聚类分析反映了其地理进展，并确定了 SARS-CoV-2 病毒的不同遗传亚群。

Genet Epidemiol. 2021 Apr;45(3):316-323. doi: 10.1002/gepi.22373. Epub 2021 Jan 8.

locStra: Fast analysis of regional/global stratification in whole-genome sequencing studies.快速分析全基因组测序研究中的区域/全局分层。

Genet Epidemiol. 2021 Feb;45(1):82-98. doi: 10.1002/gepi.22356. Epub 2020 Sep 14.

Effect of population stratification on SNP-by-environment interaction.人群分层对 SNP-环境交互作用的影响。

Genet Epidemiol. 2019 Dec;43(8):1046-1055. doi: 10.1002/gepi.22250. Epub 2019 Aug 20.

本文引用的文献

Atlas of Cryptic Genetic Relatedness Among 1000 Human Genomes.《千人基因组中隐秘遗传相关性图谱》

Genome Biol Evol. 2016 Feb 23;8(3):777-90. doi: 10.1093/gbe/evw034.

Bioinformatics. 2016 May 1;32(9):1366-72. doi: 10.1093/bioinformatics/btv752. Epub 2015 Dec 31.

High level of inbreeding in final phase of 1000 Genomes Project.千人基因组计划最后阶段的高度近亲繁殖情况。

Sci Rep. 2015 Dec 2;5:17453. doi: 10.1038/srep17453.

A global reference for human genetic variation.人类遗传变异的全球参考。

Nature. 2015 Oct 1;526(7571):68-74. doi: 10.1038/nature15393.

Inference of distant genetic relations in humans using "1000 genomes".利用“千人基因组计划”推断人类的远亲遗传关系。

Genome Biol Evol. 2015 Jan 7;7(2):481-92. doi: 10.1093/gbe/evv003.

Improved ancestry inference using weights from external reference panels.利用外部参考面板的权重提高祖先推断。

Bioinformatics. 2013 Jun 1;29(11):1399-406. doi: 10.1093/bioinformatics/btt144. Epub 2013 Mar 28.

An integrated map of genetic variation from 1,092 human genomes.1092 个人类基因组遗传变异的综合图谱。

Nature. 2012 Nov 1;491(7422):56-65. doi: 10.1038/nature11632.

Estimating kinship in admixed populations.估算混合人群中的亲属关系。

Am J Hum Genet. 2012 Jul 13;91(1):122-38. doi: 10.1016/j.ajhg.2012.05.024. Epub 2012 Jun 28.

Improved linear mixed models for genome-wide association studies.用于全基因组关联研究的改进线性混合模型。

Nat Methods. 2012 May 30;9(6):525-6. doi: 10.1038/nmeth.2037.

Differential confounding of rare and common variants in spatially structured populations.空间结构群体中罕见和常见变异的差异混淆。

Nat Genet. 2012 Feb 5;44(3):243-6. doi: 10.1038/ng.1074.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验