School of Computer Science and Engineering, University of New South Wales, Sydney, NSW, Australia.
Bioinformatics. 2010 Dec 15;26(24):3129-30. doi: 10.1093/bioinformatics/btq604. Epub 2010 Oct 29.
Immunoglobulin heavy chain genes are formed by recombination of genes randomly selected from sets of IGHV, IGHD and IGHJ genes. Utilities have been developed to identify genes that contribute to observed VDJ rearrangements, but in the absence of datasets of known rearrangements, the evaluation of these utilities is problematic. We have analyzed thousands of VDJ rearrangements from an individual (S22) whose IGHV, IGHD and IGHJ genotype can be inferred from the dataset. Knowledge of this genotype means that the Stanford_S22 dataset can serve to benchmark the performance of IGH alignment utilities.
We evaluated the performance of seven utilities. Failure to partition a sequence into genes present in the S22 genome was considered an error, and error rates for different utilities ranged from 7.1% to 13.7%.
Supplementary data includes the S22 genotypes and alignments. The Stanford_S22 dataset and an evaluation tool is available at http://www.emi.unsw.edu.au/~ihmmune/IGHUtilityEval/.
免疫球蛋白重链基因是通过从 IGHV、IGHD 和 IGHJ 基因组中随机选择的基因进行重组而形成的。已经开发了一些工具来识别导致观察到的 VDJ 重排的基因,但在缺乏已知重排数据集的情况下,这些工具的评估存在问题。我们分析了来自个体 S22 的数千个 VDJ 重排,S22 的 IGHV、IGHD 和 IGHJ 基因型可以从该数据集中推断出来。对这种基因型的了解意味着斯坦福 S22 数据集可用于基准测试 IGH 比对工具的性能。
我们评估了七种工具的性能。如果一个序列无法分成存在于 S22 基因组中的基因,则被认为是一个错误,并且不同工具的错误率范围为 7.1%至 13.7%。
补充数据包括 S22 基因型和比对。斯坦福 S22 数据集和评估工具可在 http://www.emi.unsw.edu.au/~ihmmune/IGHUtilityEval/ 上获得。