Robiou-du-Pont Sébastien, Li Aihua, Christie Shanice, Sohani Zahra N, Meyre David
Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada.
Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada; Population Health Research Institute, McMaster University and Hamilton Health Sciences, Hamilton General Hospital, Hamilton, Ontario, Canada.
PLoS One. 2015 Mar 5;10(3):e0118925. doi: 10.1371/journal.pone.0118925. eCollection 2015.
Bioinformatics tools have gained popularity in biology but little is known about their validity. We aimed to assess the early contribution of 415 single nucleotide polymorphisms (SNPs) associated with eight cardio-metabolic traits at the genome-wide significance level in adults in the Family Atherosclerosis Monitoring In earLY Life (FAMILY) birth cohort. We used the popular web-based tool SNAP to assess the availability of the 415 SNPs in the Illumina Cardio-Metabochip genotyped in the FAMILY study participants. We then compared the SNAP output with the Cardio-Metabochip file provided by Illumina using chromosome and chromosomal positions of SNPs from NCBI Human Genome Browser (Genome Reference Consortium Human Build 37). With the HapMap 3 release 2 reference, 201 out of 415 SNPs were reported as missing in the Cardio-Metabochip by the SNAP output. However, the Cardio-Metabochip file revealed that 152 of these 201 SNPs were in fact present in the Cardio-Metabochip array (false negative rate of 36.6%). With the more recent 1000 Genomes Project release, we found a false-negative rate of 17.6% by comparing the outputs of SNAP and the Illumina product file. We did not find any 'false positive' SNPs (SNPs specified as available in the Cardio-Metabochip by SNAP, but not by the Cardio-Metabochip Illumina file). The Cohen's Kappa coefficient, which calculates the percentage of agreement between both methods, indicated that the validity of SNAP was fair to moderate depending on the reference used (the HapMap 3 or 1000 Genomes). In conclusion, we demonstrate that the SNAP outputs for the Cardio-Metabochip are invalid. This study illustrates the importance of systematically assessing the validity of bioinformatics tools in an independent manner. We propose a series of guidelines to improve practices in the fast-moving field of bioinformatics software implementation.
生物信息学工具在生物学领域已颇受欢迎,但对其有效性却知之甚少。我们旨在评估与八种心脏代谢性状相关的415个单核苷酸多态性(SNP)在全基因组显著水平上对“早年生活家庭动脉粥样硬化监测”(FAMILY)出生队列中的成年人的早期贡献。我们使用了广受欢迎的基于网络的工具SNAP来评估FAMILY研究参与者中经Illumina心脏代谢芯片基因分型的415个SNP的可用性。然后,我们使用来自NCBI人类基因组浏览器(基因组参考联盟人类构建版37)的SNP的染色体和染色体位置,将SNAP输出结果与Illumina提供的心脏代谢芯片文件进行比较。以HapMap 3版本2为参考,SNAP输出结果显示在心脏代谢芯片中415个SNP中有201个缺失。然而,心脏代谢芯片文件显示这201个SNP中有152个实际上存在于心脏代谢芯片阵列中(假阴性率为36.6%)。以更新的千人基因组计划版本为参考,通过比较SNAP输出结果和Illumina产品文件,我们发现假阴性率为17.6%。我们未发现任何“假阳性”SNP(即SNAP指定在心脏代谢芯片中可用,但Illumina心脏代谢芯片文件中未列出的SNP)。计算两种方法之间一致性百分比的科恩卡帕系数表明,根据所使用的参考(HapMap 3或千人基因组),SNAP的有效性为中等。总之,我们证明了心脏代谢芯片的SNAP输出结果是无效的。本研究说明了以独立方式系统评估生物信息学工具有效性的重要性。我们提出了一系列指导方针,以改进生物信息学软件实施这一快速发展领域的实践。