Bai Yu, Ni Min, Cooper Blerta, Wei Yi, Fury Wen
Regeneron Pharmaceuticals, Inc, Tarrytown, New York, USA.
BMC Genomics. 2014 May 1;15(1):325. doi: 10.1186/1471-2164-15-325.
Accurate HLA typing at amino acid level (four-digit resolution) is critical in hematopoietic and organ transplantations, pathogenesis studies of autoimmune and infectious diseases, as well as the development of immunoncology therapies. With the rapid adoption of genome-wide sequencing in biomedical research, HLA typing based on transcriptome and whole exome/genome sequencing data becomes increasingly attractive due to its high throughput and convenience. However, unlike targeted amplicon sequencing, genome-wide sequencing often employs a reduced read length and coverage that impose great challenges in resolving the highly homologous HLA alleles. Though several algorithms exist and have been applied to four-digit typing, some deliver low to moderate accuracies, some output ambiguous predictions. Moreover, few methods suit diverse read lengths and depths, and both RNA and DNA sequencing inputs. New algorithms are therefore needed to leverage the accuracy and flexibility of HLA typing at high resolution using genome-wide sequencing data.
We have developed a new algorithm named PHLAT to discover the most probable pair of HLA alleles at four-digit resolution or higher, via a unique integration of a candidate allele selection and a likelihood scoring. Over a comprehensive set of benchmarking data (a total of 768 HLA alleles) from both RNA and DNA sequencing and with a broad range of read lengths and coverage, PHLAT consistently achieves a high accuracy at four-digit (92%-95%) and two-digit resolutions (96%-99%), outcompeting most of the existing methods. It also supports targeted amplicon sequencing data from Illumina Miseq.
PHLAT significantly leverages the accuracy and flexibility of high resolution HLA typing based on genome-wide sequencing data. It may benefit both basic and applied research in immunology and related fields as well as numerous clinical applications.
在造血和器官移植、自身免疫性疾病和感染性疾病的发病机制研究以及免疫肿瘤学治疗的发展中,准确的氨基酸水平HLA分型(四位分辨率)至关重要。随着全基因组测序在生物医学研究中的迅速采用,基于转录组和全外显子组/基因组测序数据的HLA分型因其高通量和便利性而变得越来越有吸引力。然而,与靶向扩增子测序不同,全基因组测序通常采用较短的读长和覆盖度,这给解析高度同源的HLA等位基因带来了巨大挑战。尽管存在几种算法并已应用于四位分型,但有些算法的准确率较低至中等,有些则输出模糊的预测结果。此外,很少有方法适用于不同的读长和深度,以及RNA和DNA测序输入。因此,需要新的算法来利用全基因组测序数据在高分辨率下进行HLA分型的准确性和灵活性。
我们开发了一种名为PHLAT的新算法,通过独特地整合候选等位基因选择和似然评分,以四位分辨率或更高分辨率发现最可能的一对HLA等位基因。在来自RNA和DNA测序的一组全面的基准数据(总共768个HLA等位基因)上,以及在广泛的读长和覆盖度范围内,PHLAT在四位分辨率(92%-95%)和两位分辨率(96%-99%)上始终保持高准确率,优于大多数现有方法。它还支持来自Illumina Miseq的靶向扩增子测序数据。
PHLAT显著利用了基于全基因组测序数据的高分辨率HLA分型的准确性和灵活性。它可能有益于免疫学及相关领域的基础研究和应用研究以及众多临床应用。