Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China.
Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China.
PLoS Genet. 2018 Feb 23;14(2):e1007206. doi: 10.1371/journal.pgen.1007206. eCollection 2018 Feb.
Hepatitis B virus (HBV) infection is a common problem in the world, especially in China. More than 60-80% of hepatocellular carcinoma (HCC) cases can be attributed to HBV infection in high HBV prevalent regions. Although traditional Sanger sequencing has been extensively used to investigate HBV sequences, NGS is becoming more commonly used. Further, it is unknown whether word pattern frequencies of HBV reads by Next Generation Sequencing (NGS) can be used to investigate HBV genotypes and predict HCC status. In this study, we used NGS to sequence the pre-S region of the HBV sequence of 94 HCC patients and 45 chronic HBV (CHB) infected individuals. Word pattern frequencies among the sequence data of all individuals were calculated and compared using the Manhattan distance. The individuals were grouped using principal coordinate analysis (PCoA) and hierarchical clustering. Word pattern frequencies were also used to build prediction models for HCC status using both K-nearest neighbors (KNN) and support vector machine (SVM). We showed the extremely high power of analyzing HBV sequences using word patterns. Our key findings include that the first principal coordinate of the PCoA analysis was highly associated with the fraction of genotype B (or C) sequences and the second principal coordinate was significantly associated with the probability of having HCC. Hierarchical clustering first groups the individuals according to their major genotypes followed by their HCC status. Using cross-validation, high area under the receiver operational characteristic curve (AUC) of around 0.88 for KNN and 0.92 for SVM were obtained. In the independent data set of 46 HCC patients and 31 CHB individuals, a good AUC score of 0.77 was obtained using SVM. It was further shown that 3000 reads for each individual can yield stable prediction results for SVM. Thus, another key finding is that word patterns can be used to predict HCC status with high accuracy. Therefore, our study shows clearly that word pattern frequencies of HBV sequences contain much information about the composition of different HBV genotypes and the HCC status of an individual.
乙型肝炎病毒(HBV)感染是一个全球性的常见问题,尤其是在中国。在高 HBV 流行地区,超过 60-80%的肝细胞癌(HCC)病例可归因于 HBV 感染。尽管传统的 Sanger 测序已广泛用于研究 HBV 序列,但 NGS 越来越常用。此外,尚不清楚通过下一代测序(NGS)读取 HBV 的词频模式是否可用于研究 HBV 基因型并预测 HCC 状态。在这项研究中,我们使用 NGS 对 94 例 HCC 患者和 45 例慢性 HBV(CHB)感染个体的 HBV 序列前 S 区进行测序。使用曼哈顿距离计算并比较所有个体序列数据中的词频模式。使用主坐标分析(PCoA)和层次聚类对个体进行分组。还使用 K-最近邻(KNN)和支持向量机(SVM)构建 HCC 状态的预测模型。我们展示了使用词模式分析 HBV 序列的极高能力。我们的主要发现包括:PCoA 分析的第一主坐标与基因型 B(或 C)序列的分数高度相关,第二主坐标与 HCC 发生的概率显著相关。层次聚类首先根据主要基因型对个体进行分组,然后根据 HCC 状态进行分组。使用交叉验证,KNN 的 AUC 约为 0.88,SVM 的 AUC 约为 0.92。在 46 例 HCC 患者和 31 例 CHB 个体的独立数据集上,SVM 获得了良好的 AUC 评分 0.77。进一步表明,每个个体 3000 个读取可产生 SVM 稳定的预测结果。因此,另一个关键发现是,HBV 序列的词频模式可以用于高精度预测 HCC 状态。因此,我们的研究清楚地表明,HBV 序列的词频模式包含有关不同 HBV 基因型组成和个体 HCC 状态的大量信息。