Intelligent Information Retrieval Lab, Department of Computer Science and Information Engineering, National Cheng Kung University.
Internal Medicine Department.
Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa281.
Several studies to date have proposed different types of interpreters for measuring the degree of pathogenicity of variants. However, in predicting the disease type and disease-gene associations, scholars face two essential challenges, namely the vast number of existing variants and the existence of variants which are recognized as variant of uncertain significance (VUS). To tackle these challenges, we propose algorithms to assign a significance to each gene rather than each variant, describing its degree of pathogenicity. Since the interpreters identified most of the variants as VUS, most of the gene scores were identified as uncertain significance. To predict the uncertain significance scores, we design two matrix factorization-based models: the common latent space model uses genomics variant data as well as heterogeneous clinical data, while the single-matrix factorization model can be used when heterogeneous clinical data are unavailable. We have managed to show that the models successfully predict the uncertain significance scores with low error and high accuracy. Moreover, to evaluate the effectiveness of our novel input features, we train five different multi-label classifiers including a feedforward neural network with the same feature set and show they all achieve high accuracy as the main impact of our approach comes from the features. Availability: The source code is freely available at https://github.com/sabdollahi/CoLaSpSMFM.
迄今为止,已有多项研究提出了不同类型的翻译员来衡量变异的致病性程度。然而,在预测疾病类型和疾病基因关联时,学者们面临着两个基本挑战,即大量存在的变体和被认为是意义不确定的变体(VUS)的存在。为了应对这些挑战,我们提出了一种算法,为每个基因而不是每个变体分配一个意义,描述其致病性程度。由于翻译员将大多数变体识别为 VUS,因此大多数基因评分被识别为意义不确定。为了预测不确定的意义评分,我们设计了两个基于矩阵分解的模型:公共潜在空间模型使用基因组变异数据以及异构临床数据,而当没有异构临床数据时,可以使用单矩阵分解模型。我们已经成功地表明,这些模型可以成功地以低误差和高精度预测不确定的意义评分。此外,为了评估我们新颖的输入特征的有效性,我们使用相同的特征集训练了五个不同的多标签分类器,包括前馈神经网络,并表明它们都具有很高的准确性,因为我们的方法的主要影响来自于特征。
源代码可在 https://github.com/sabdollahi/CoLaSpSMFM 上免费获取。