Quan Chenxu, Liu Fenghui, Qi Lin, Tie Yun
School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou, China.
Department of Respiratory and Sleep Medicine, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China.
Interdiscip Sci. 2023 Jun;15(2):217-230. doi: 10.1007/s12539-023-00554-2. Epub 2023 Feb 27.
Somatic mutations often occur at high relapse sites in protein sequences, which indicates that the location clustering of somatic missense mutations can be used to identify driving genes. However, the traditional clustering algorithm has such problems as the background signal over-fitting, the clustering algorithm is not suitable for mutation data, and the performance of identifying low-frequency mutation genes needs to be improved. In this paper, we propose a linear clustering algorithm based on likelihood ratio test knowledge to identify driver genes. In this experiment, firstly, the polynucleotide mutation rate is calculated based on the prior knowledge of likelihood ratio test. Then, the simulation data set is obtained through the background mutation rate model. Finally, the unsupervised peak clustering algorithm is used to, respectively, evaluate the somatic mutation data and the simulation data to identify the driver genes. The experimental results show that our method achieves a better balance of precision and sensitivity. It can also identify the driver genes missed by other methods, making it an effective supplement to other methods. We also discover some potential linkages between genes and between genes and mutation sites, which is of great value to target drug therapy research. Method framework: Our proposed model framework is as follows. a. Counting mutation sites and the number of mutations in tumor gene elements. b. The nucleotide context mutation frequency is counted based on the likelihood ratio test knowledge, and the background mutation rate model is obtained. c. Based on Monte Carlo simulation method, data sets with the same number of mutations as gene elements are randomly sampled to obtain simulated mutation data, and the sampling frequency of each mutation site is related to the mutation rate of polynucleotide. d. The original mutation data and the simulated mutation data after random reconstruction are clustered by peak density, respectively, and the corresponding clustering scores are obtained. e. We can obtain the clustering information statistics in each gene segment and score of each gene segment from the original single nucleotide mutation data through step d. f. According to the observed score and the simulated clustering score, the p-value of the corresponding gene fragment is calculated. g. We can obtain the clustering information statistics in each gene segment and score of each gene segment from the simulated single nucleotide mutation data through step d.
体细胞突变经常发生在蛋白质序列中的高复发位点,这表明体细胞错义突变的位置聚类可用于识别驱动基因。然而,传统的聚类算法存在背景信号过拟合、聚类算法不适用于突变数据以及识别低频突变基因的性能有待提高等问题。在本文中,我们提出了一种基于似然比检验知识的线性聚类算法来识别驱动基因。在本实验中,首先,基于似然比检验的先验知识计算多核苷酸突变率。然后,通过背景突变率模型获得模拟数据集。最后,使用无监督峰值聚类算法分别评估体细胞突变数据和模拟数据以识别驱动基因。实验结果表明,我们的方法在精度和灵敏度之间实现了更好的平衡。它还可以识别其他方法遗漏的驱动基因,成为其他方法的有效补充。我们还发现了基因之间以及基因与突变位点之间的一些潜在联系,这对靶向药物治疗研究具有重要价值。方法框架:我们提出的模型框架如下。a. 计算肿瘤基因元件中的突变位点和突变数量。b. 基于似然比检验知识计算核苷酸上下文突变频率,并获得背景突变率模型。c. 基于蒙特卡罗模拟方法,对与基因元件突变数量相同的数据集进行随机采样,以获得模拟突变数据,每个突变位点的采样频率与多核苷酸的突变率相关。d. 分别对原始突变数据和随机重建后的模拟突变数据进行峰值密度聚类,得到相应的聚类分数。e. 通过步骤d,我们可以从原始单核苷酸突变数据中获得每个基因片段的聚类信息统计和每个基因片段的分数。f. 根据观察到的分数和模拟聚类分数,计算相应基因片段的p值。g. 通过步骤d,我们可以从模拟单核苷酸突变数据中获得每个基因片段的聚类信息统计和每个基因片段的分数。