比较大规模平行测序流程的统计方法。

Statistical method to compare massive parallel sequencing pipelines.

作者信息

Elsensohn M H, Leblay N, Dimassi S, Campan-Fournier A, Labalme A, Roucher-Boulez F, Sanlaville D, Lesca G, Bardel C, Roy P

机构信息

Service de Biostatistique-Bioinformatique, Hospices Civils de Lyon, 162 avenue Lacassagne, F-69003, Lyon, France.

Université de Lyon, Lyon, France.

出版信息

BMC Bioinformatics. 2017 Mar 1;18(1):139. doi: 10.1186/s12859-017-1552-9.

DOI:10.1186/s12859-017-1552-9

PMID:28249565

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5333416/

Abstract

BACKGROUND

Today, sequencing is frequently carried out by Massive Parallel Sequencing (MPS) that cuts drastically sequencing time and expenses. Nevertheless, Sanger sequencing remains the main validation method to confirm the presence of variants. The analysis of MPS data involves the development of several bioinformatic tools, academic or commercial. We present here a statistical method to compare MPS pipelines and test it in a comparison between an academic (BWA-GATK) and a commercial pipeline (TMAP-NextGENe®), with and without reference to a gold standard (here, Sanger sequencing), on a panel of 41 genes in 43 epileptic patients. This method used the number of variants to fit log-linear models for pairwise agreements between pipelines. To assess the heterogeneity of the margins and the odds ratios of agreement, four log-linear models were used: a full model, a homogeneous-margin model, a model with single odds ratio for all patients, and a model with single intercept. Then a log-linear mixed model was fitted considering the biological variability as a random effect.

RESULTS

Among the 390,339 base-pairs sequenced, TMAP-NextGENe® and BWA-GATK found, on average, 2253.49 and 1857.14 variants (single nucleotide variants and indels), respectively. Against the gold standard, the pipelines had similar sensitivities (63.47% vs. 63.42%) and close but significantly different specificities (99.57% vs. 99.65%; p < 0.001). Same-trend results were obtained when only single nucleotide variants were considered (99.98% specificity and 76.81% sensitivity for both pipelines).

CONCLUSIONS

The method allows thus pipeline comparison and selection. It is generalizable to all types of MPS data and all pipelines.

摘要

背景

如今，大规模平行测序（MPS）频繁用于测序，这大幅缩短了测序时间和成本。然而，桑格测序仍然是确认变异存在的主要验证方法。MPS数据分析涉及多种学术或商业生物信息学工具的开发。我们在此介绍一种统计方法，用于比较MPS流程，并在43例癫痫患者的41个基因面板上，在有和没有参考金标准（此处为桑格测序）的情况下，对一种学术流程（BWA - GATK）和一种商业流程（TMAP - NextGENe®）进行比较测试。该方法使用变异数量来拟合流程间成对一致性的对数线性模型。为评估一致性的边际异质性和优势比，使用了四个对数线性模型：一个全模型、一个同质边际模型、一个对所有患者具有单一优势比的模型以及一个具有单一截距的模型。然后拟合一个对数线性混合模型，将生物学变异性视为随机效应。

结果

在测序的390,339个碱基对中，TMAP - NextGENe®和BWA - GATK平均分别发现2253.49个和1857.14个变异（单核苷酸变异和插入缺失）。与金标准相比，这两种流程具有相似的敏感性（63.47%对63.42%）和相近但显著不同的特异性（99.57%对99.65%；p < 0.001）。当仅考虑单核苷酸变异时，也得到了相同趋势的结果（两种流程的特异性均为99.98%，敏感性均为76.81%）。