Cheng Chun-Pei, Lan Kuo-Lun, Liu Wen-Chun, Chang Ting-Tsung, Tseng Vincent S
Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan.
Department of Internal Medicine, National Cheng Kung University Medical College and Hospital, Tainan 701, Taiwan.
Methods. 2016 Dec 1;111:56-63. doi: 10.1016/j.ymeth.2016.07.020. Epub 2016 Jul 30.
Hepatitis B viral (HBV) infection is strongly associated with an increased risk of liver diseases like cirrhosis or hepatocellular carcinoma (HCC). Many lines of evidence suggest that deletions occurring in HBV genomic DNA are highly associated with the activity of HBV via the interplay between aberrant viral proteins release and human immune system. Deletions finding on the HBV whole genome sequences is thus a very important issue though there exist underlying the challenges in mining such big and complex biological data. Although some next generation sequencing (NGS) tools are recently designed for identifying structural variations such as insertions or deletions, their validity is generally committed to human sequences study. This design may not be suitable for viruses due to different species. We propose a graphics processing unit (GPU)-based data mining method called DeF-GPU to efficiently and precisely identify HBV deletions from large NGS data, which generally contain millions of reads. To fit the single instruction multiple data instructions, sequencing reads are referred to as multiple data and the deletion finding procedure is referred to as a single instruction. We use Compute Unified Device Architecture (CUDA) to parallelize the procedures, and further validate DeF-GPU on 5 synthetic and 1 real datasets. Our results suggest that DeF-GPU outperforms the existing commonly-used method Pindel and is able to exactly identify the deletions of our ground truth in few seconds. The source code and other related materials are available at https://sourceforge.net/projects/defgpu/.
乙型肝炎病毒(HBV)感染与肝硬化或肝细胞癌(HCC)等肝脏疾病风险增加密切相关。许多证据表明,HBV基因组DNA中发生的缺失通过异常病毒蛋白释放与人体免疫系统之间的相互作用,与HBV的活性高度相关。因此,在HBV全基因组序列上发现缺失是一个非常重要的问题,尽管挖掘如此庞大而复杂的生物学数据存在挑战。尽管最近设计了一些新一代测序(NGS)工具来识别插入或缺失等结构变异,但其有效性通常局限于人类序列研究。由于物种不同,这种设计可能不适用于病毒。我们提出了一种基于图形处理单元(GPU)的数据挖掘方法DeF-GPU,以从通常包含数百万条读数的大型NGS数据中高效、精确地识别HBV缺失。为了适应单指令多数据指令,测序读数被视为多数据,缺失发现过程被视为单指令。我们使用统一计算设备架构(CUDA)对这些过程进行并行化,并在5个合成数据集和1个真实数据集上进一步验证了DeF-GPU。我们的结果表明,DeF-GPU优于现有的常用方法Pindel,并且能够在几秒钟内准确识别我们真实数据中的缺失。源代码和其他相关材料可在https://sourceforge.net/projects/defgpu/获取。