从序列读取存档中的250,000次人类测序运行中提取等位基因读数计数。

Extracting allelic read counts from 250,000 human sequencing runs in Sequence Read Archive.

作者信息

Tsui Brian, Dow Michelle, Skola Dylan, Carter Hannah

机构信息

Department of Medicine, University of California San Diego, 9500 Gilman, San Diego, California 92093, USA.

出版信息

Pac Symp Biocomput. 2019;24:196-207.

PMID:30864322

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6415672/

Abstract

The Sequence Read Archive (SRA) contains over one million publicly available sequencing runs from various studies using a variety of sequencing library strategies. These data inherently contain information about underlying genomic sequence variants which we exploit to extract allelic read counts on an unprecedented scale. We reprocessed over 250,000 human sequencing runs (>1000 TB data worth of raw sequence data) into a single unified dataset of allelic read counts for nearly 300,000 variants of biomedical relevance curated by NCBI dbSNP, where germline variants were detected in a median of 912 sequencing runs, and somatic variants were detected in a median of 4,876 sequencing runs, suggesting that this dataset facilitates identification of sequencing runs that harbor variants of interest. Allelic read counts obtained using a targeted alignment were very similar to read counts obtained from whole-genome alignment. Analyzing allelic read count data for matched DNA and RNA samples from tumors, we find that RNA-seq can also recover variants identified by Whole Exome Sequencing (WXS), suggesting that reprocessed allelic read counts can support variant detection across different library strategies in SRA. This study provides a rich database of known human variants across SRA samples that can support future meta-analyses of human sequence variation.

摘要

序列读取存档（SRA）包含来自各种研究的超过100万个公开可用的测序运行数据，这些研究采用了多种测序文库策略。这些数据本身包含有关潜在基因组序列变异的信息，我们利用这些信息以前所未有的规模提取等位基因读数计数。我们将超过250,000次人类测序运行（超过1000 TB的原始序列数据）重新处理为一个单一的统一数据集，该数据集包含由NCBI dbSNP策划的近300,000个具有生物医学相关性的变体的等位基因读数计数，其中种系变体在中位数为912次测序运行中被检测到，体细胞变体在中位数为4,876次测序运行中被检测到，这表明该数据集有助于识别包含感兴趣变体的测序运行。使用靶向比对获得的等位基因读数计数与从全基因组比对获得的读数计数非常相似。分析来自肿瘤的匹配DNA和RNA样本的等位基因读数计数数据，我们发现RNA测序也可以恢复通过全外显子测序（WXS）鉴定的变体，这表明重新处理的等位基因读数计数可以支持SRA中不同文库策略的变体检测。这项研究提供了一个丰富的数据库，包含SRA样本中已知的人类变体，可以支持未来对人类序列变异的荟萃分析。