Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland, United States of America.
Cancer Genomics Research Laboratory, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland, United States of America.
PLoS One. 2023 Jan 25;18(1):e0280951. doi: 10.1371/journal.pone.0280951. eCollection 2023.
The use of publicly available sequencing datasets as controls (hereafter, "public controls") in studies of rare variant disease associations has great promise but can increase the risk of false-positive discovery. The specific factors that could contribute to inflated distribution of test statistics have not been systematically examined. Here, we leveraged both public controls, gnomAD v2.1 and several datasets sequenced in our laboratory to systematically investigate factors that could contribute to the false-positive discovery, as measured by λΔ95, a measure to quantify the degree of inflation in statistical significance. Analyses of datasets in this investigation found that 1) the significantly inflated distribution of test statistics decreased substantially when the same variant caller and filtering pipelines were employed, 2) differences in library prep kits and sequencers did not affect the false-positive discovery rate and, 3) joint vs. separate variant-calling of cases and controls did not contribute to the inflation of test statistics. Currently available methods do not adequately adjust for the high false-positive discovery. These results, especially if replicated, emphasize the risks of using public controls for rare-variant association tests in which individual-level data and the computational pipeline are not readily accessible, which prevents the use of the same variant-calling and filtering pipelines on both cases and controls. A plausible solution exists with the emergence of cloud-based computing, which can make it possible to bring containerized analytical pipelines to the data (rather than the data to the pipeline) and could avert or minimize these issues. It is suggested that future reports account for this issue and provide this as a limitation in reporting new findings based on studies that cannot practically analyze all data on a single pipeline.
利用公开可用的测序数据集作为对照(以下简称“公共对照”)来研究罕见变异疾病关联具有很大的潜力,但也会增加假阳性发现的风险。导致检验统计量分布膨胀的具体因素尚未系统地进行检查。在这里,我们利用公共对照(gnomAD v2.1 及我们实验室测序的几个数据集),系统地研究了可能导致假阳性发现的因素,这一因素通过 λΔ95 来衡量,λΔ95 是一种量化统计显著性膨胀程度的指标。对本研究中数据集的分析发现:1)当使用相同的变异caller 和过滤管道时,检验统计量的显著膨胀分布显著减少;2)文库制备试剂盒和测序仪的差异不影响假阳性发现率;3)病例和对照的联合或单独变异calling 不会导致检验统计量的膨胀。目前可用的方法并不能充分调整高假阳性发现的问题。如果这些结果得到复制,它们将特别强调在无法获得个体水平数据和计算管道的情况下,使用公共对照进行罕见变异关联测试的风险,这使得无法在病例和对照上使用相同的变异calling 和过滤管道。云计算的出现提供了一个可行的解决方案,它可以使将容器化的分析管道带到数据(而不是将数据带到管道)成为可能,并可以避免或最小化这些问题。建议未来的报告考虑到这一问题,并将其作为无法在单一管道上实际分析所有数据的研究报告新发现的一个局限性。