Vanderbilt Ingram Cancer Center, Nashville, TN, USA.
BMC Genomics. 2012 Nov 24;13:666. doi: 10.1186/1471-2164-13-666.
When using Illumina high throughput short read data, sometimes the genotype inferred from the positive strand and negative strand are significantly different, with one homozygous and the other heterozygous. This phenomenon is known as strand bias. In this study, we used Illumina short-read sequencing data to evaluate the effect of strand bias on genotyping quality, and to explore the possible causes of strand bias.
We collected 22 breast cancer samples from 22 patients and sequenced their exome using the Illumina GAIIx machine. By comparing the consistency between the genotypes inferred from this sequencing data with the genotypes inferred from SNP chip data, we found that, when using sequencing data, SNPs with extreme strand bias did not have significantly lower consistency rates compared to SNPs with low or no strand bias. However, this result may be limited by the small subset of SNPs present in both the exome sequencing and the SNP chip data. We further compared the transition and transversion ratio and the number of novel non-synonymous SNPs between the SNPs with low or no strand bias and those with extreme strand bias, and found that SNPs with low or no strand bias have better overall quality. We also discovered that the strand bias occurs randomly at genomic positions across these samples, and observed no consistent pattern of strand bias location across samples. By comparing results from two different aligners, BWA and Bowtie, we found very consistent strand bias patterns. Thus strand bias is unlikely to be caused by alignment artifacts. We successfully replicated our results using two additional independent datasets with different capturing methods and Illumina sequencers.
Extreme strand bias indicates a potential high false-positive rate for SNPs.
当使用 Illumina 高通量短读数据时,有时从正链和负链推断出的基因型有显著差异,一个是纯合的,另一个是杂合的。这种现象称为链偏倚。在这项研究中,我们使用 Illumina 短读测序数据来评估链偏倚对基因分型质量的影响,并探讨链偏倚的可能原因。
我们从 22 名患者中收集了 22 个乳腺癌样本,并使用 Illumina GAIIx 机器对其外显子进行测序。通过比较从这些测序数据推断出的基因型与从 SNP 芯片数据推断出的基因型之间的一致性,我们发现,当使用测序数据时,具有极端链偏倚的 SNP 与具有低或无链偏倚的 SNP 相比,其一致性率没有显著降低。然而,这一结果可能受到外显子测序和 SNP 芯片数据中存在的 SNP 子集的限制。我们进一步比较了低或无链偏倚 SNP 和极端链偏倚 SNP 之间的转换和颠换比以及新的非同义 SNP 数量,发现低或无链偏倚 SNP 具有更好的整体质量。我们还发现,链偏倚在这些样本的基因组位置上随机发生,并且在样本之间没有观察到一致的链偏倚位置模式。通过比较两种不同的比对器(BWA 和 Bowtie)的结果,我们发现了非常一致的链偏倚模式。因此,链偏倚不太可能是由比对伪影引起的。我们使用两种具有不同捕获方法和 Illumina 测序仪的额外独立数据集成功复制了我们的结果。
极端的链偏倚表明 SNP 的假阳性率可能很高。