IEEE J Biomed Health Inform. 2022 Oct;26(10):5075-5084. doi: 10.1109/JBHI.2022.3199462. Epub 2022 Oct 4.
Increasing evidence suggest that circRNA, as one of the most promising emerging biomarkers, has a very close relationship with diseases. Exploring the relationship between circRNA and diseases can provide novel perspective for diseases diagnosis and pathogenesis. The existing circRNA-disease association (CDA) prediction models, however, generally treat the data attributes equally, do not pay special attention to the attributes with more significant influence, and do not make full use of the correlation and symbiosis between attributes to dig into the latent semantic information of the data. Therefore, in response to the above problems, this paper proposes a natural semantic enhancement method NSECDA to predict CDA. In practical terms, we first recognize the circRNA sequence as a biological language, and analyze its natural semantic properties through the natural language understanding theory; then integrate it with disease attributes, circRNA and disease Gaussian Interaction Profile (GIP) kernel attributes, and use Graph Attention Network (GAT) to focus on the influential attributes, so as to mine the deeply hidden features; finally, the Rotation Forest (RoF) classifier was used to accurately determine CDA. In the gold standard data set CircR2Disease, NSECDA achieved 92.49% accuracy with 0.9225 AUC score. In comparison with the non-natural semantic enhancement model and other classifier models, NSECDA also shows competitive performance. Additionally, 25 of the CDA pairs with unknown associations in the top 30 prediction scores of NSECDA have been proven by newly reported studies. These achievements suggest that NSECDA is an effective model to predict CDA, which can provide credible candidate for subsequent wet experiments, thus significantly reducing the scope of investigations.
越来越多的证据表明,circRNA 作为最有前途的新兴生物标志物之一,与疾病有着非常密切的关系。探索 circRNA 与疾病之间的关系,可以为疾病的诊断和发病机制提供新的视角。然而,现有的 circRNA 疾病关联(CDA)预测模型通常平等对待数据属性,不特别关注具有更显著影响的属性,也没有充分利用属性之间的相关性和共生关系,挖掘数据的潜在语义信息。因此,针对上述问题,本文提出了一种自然语义增强方法 NSECDA 来预测 CDA。在实际应用中,我们首先将 circRNA 序列识别为生物语言,并通过自然语言理解理论分析其自然语义特性;然后将其与疾病属性、circRNA 和疾病高斯交互特征(GIP)核属性相结合,并利用图注意力网络(GAT)关注有影响力的属性,从而挖掘深层次的隐藏特征;最后,利用旋转森林(RoF)分类器准确确定 CDA。在 CircR2Disease 金标准数据集上,NSECDA 的准确率为 92.49%,AUC 评分为 0.9225。与非自然语义增强模型和其他分类器模型相比,NSECDA 也表现出了竞争性能。此外,在 NSECDA 预测得分前 30 名中,有 25 对未知关联的 CDA 已被新报道的研究证明。这些结果表明,NSECDA 是一种有效的 CDA 预测模型,可以为后续的湿实验提供可靠的候选对象,从而显著缩小研究范围。