Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA.
BMC Bioinformatics. 2010 Dec 14;11 Suppl 11(Suppl 11):S10. doi: 10.1186/1471-2105-11-S11-S10.
The human leukocyte antigen system (HLA) contains many highly variable genes. HLA genes play an important role in the human immune system, and HLA gene matching is crucial for the success of human organ transplantations. Numerous studies have demonstrated that variation in HLA genes is associated with many autoimmune, inflammatory and infectious diseases. However, typing HLA genes by serology or PCR is time consuming and expensive, which limits large-scale studies involving HLA genes. Since it is much easier and cheaper to obtain single nucleotide polymorphism (SNP) genotype data, accurate computational algorithms to infer HLA gene types from SNP genotype data are in need. To infer HLA types from SNP genotypes, the first step is to infer SNP haplotypes from genotypes. However, for the same SNP genotype data set, the haplotype configurations inferred by different methods are usually inconsistent, and it is often difficult to decide which one is true.
In this paper, we design an accurate HLA gene type inference algorithm by utilizing SNP genotype data from pedigrees, known HLA gene types of some individuals and the relationship between inferred SNP haplotypes and HLA gene types. Given a set of haplotypes inferred from the genotypes of a population consisting of many pedigrees, the algorithm first constructs a weighted similarity graph based on a new haplotype similarity measure and derives constraint edges from known HLA gene types. Based on the principle that different HLA gene alleles should have different background haplotypes, the algorithm searches for an optimal labeling of all the haplotypes with unknown HLA gene types such that the total weight among the same HLA gene types is maximized. To deal with ambiguous haplotype solutions, we use a genetic algorithm to select haplotype configurations that tend to maximize the same optimization criterion. Our experiments on a previously typed subset of the HapMap data show that the algorithm is highly accurate, achieving an accuracy of 96% for gene HLA-A, 95% for HLA-B, 97% for HLA-C, 84% for HLA-DRB1, 98% for HLA-DQA1 and 97% for HLA-DQB1 in a leave-one-out test.
Our algorithm can infer HLA gene types from neighboring SNP genotype data accurately. Compared with a recent approach on the same input data, our algorithm achieved a higher accuracy. The code of our algorithm is available to the public for free upon request to the corresponding authors.
人类白细胞抗原系统(HLA)包含许多高度多变的基因。HLA 基因在人类免疫系统中起着重要作用,HLA 基因匹配对于人类器官移植的成功至关重要。许多研究表明,HLA 基因的变异与许多自身免疫、炎症和感染性疾病有关。然而,通过血清学或 PCR 对 HLA 基因进行分型既耗时又昂贵,这限制了涉及 HLA 基因的大规模研究。由于获取单核苷酸多态性(SNP)基因型数据更容易且更便宜,因此需要准确的计算算法来从 SNP 基因型数据推断 HLA 基因类型。为了从 SNP 基因型推断 HLA 类型,第一步是从基因型推断 SNP 单倍型。然而,对于相同的 SNP 基因型数据集,不同方法推断的单倍型配置通常不一致,而且通常很难确定哪一个是真实的。
本文利用来自家系的 SNP 基因型数据、一些个体的已知 HLA 基因类型以及推断的 SNP 单倍型与 HLA 基因类型之间的关系,设计了一种准确的 HLA 基因类型推断算法。对于由许多家系组成的人群的基因型推断的一组单倍型,该算法首先基于新的单倍型相似性度量构建加权相似性图,并从已知的 HLA 基因类型中推导出约束边。基于不同 HLA 等位基因应具有不同背景单倍型的原理,该算法搜索具有未知 HLA 基因类型的所有单倍型的最佳标记,以使同一 HLA 基因类型之间的总权重最大化。为了解决模糊的单倍型解决方案,我们使用遗传算法选择倾向于最大化相同优化标准的单倍型配置。我们在 HapMap 数据的一个先前分型子集中进行的实验表明,该算法具有很高的准确性,在留一法测试中,基因 HLA-A 的准确率为 96%,HLA-B 为 95%,HLA-C 为 97%,HLA-DRB1 为 84%,HLA-DQA1 为 98%,HLA-DQB1 为 97%。
我们的算法可以从邻近的 SNP 基因型数据准确推断 HLA 基因类型。与同一输入数据的最近方法相比,我们的算法具有更高的准确性。我们的算法的代码可根据要求免费提供给相应的作者。