Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, United States.
Center for Innovation to Implementation, VA Palo Alto Health Care System, Sacramento, CA, United States.
J Med Internet Res. 2022 Jul 6;24(7):e38584. doi: 10.2196/38584.
Multiple types of biomedical associations of knowledge graphs, including COVID-19-related ones, are constructed based on co-occurring biomedical entities retrieved from recent literature. However, the applications derived from these raw graphs (eg, association predictions among genes, drugs, and diseases) have a high probability of false-positive predictions as co-occurrences in the literature do not always mean there is a true biomedical association between two entities.
Data quality plays an important role in training deep neural network models; however, most of the current work in this area has been focused on improving a model's performance with the assumption that the preprocessed data are clean. Here, we studied how to remove noise from raw knowledge graphs with limited labeled information.
The proposed framework used generative-based deep neural networks to generate a graph that can distinguish the unknown associations in the raw training graph. Two generative adversarial network models, NetGAN and Cross-Entropy Low-rank Logits (CELL), were adopted for the edge classification (ie, link prediction), leveraging unlabeled link information based on a real knowledge graph built from LitCovid and Pubtator.
The performance of link prediction, especially in the extreme case of training data versus test data at a ratio of 1:9, demonstrated that the proposed method still achieved favorable results (area under the receiver operating characteristic curve >0.8 for the synthetic data set and 0.7 for the real data set), despite the limited amount of testing data available.
Our preliminary findings showed the proposed framework achieved promising results for removing noise during data preprocessing of the biomedical knowledge graph, potentially improving the performance of downstream applications by providing cleaner data.
多种类型的生物医学关联知识图谱,包括与 COVID-19 相关的知识图谱,都是基于从近期文献中检索到的共同出现的生物医学实体构建的。然而,这些原始图谱衍生出的应用(例如,基因、药物和疾病之间的关联预测)存在很高的假阳性预测概率,因为文献中的共同出现并不总是意味着两个实体之间存在真正的生物医学关联。
数据质量在训练深度神经网络模型中起着重要作用;然而,该领域的大多数当前工作都集中在提高模型的性能上,前提是预处理的数据是干净的。在这里,我们研究了如何在有限的有标签信息的情况下从原始知识图谱中去除噪声。
所提出的框架使用基于生成的深度神经网络来生成一个能够区分原始训练图中未知关联的图。采用了两个生成对抗网络模型 NetGAN 和 Cross-Entropy Low-rank Logits (CELL) 进行边分类(即链接预测),利用基于真实知识图谱的无标签链接信息,该知识图谱是由 LitCovid 和 Pubtator 构建的。
链接预测的性能,特别是在训练数据与测试数据的比例为 1:9 的极端情况下,表明尽管可用的测试数据有限,但所提出的方法仍然取得了较好的结果(合成数据集的接收者操作特征曲线下面积>0.8,真实数据集的面积>0.7)。
我们的初步研究结果表明,所提出的框架在生物医学知识图谱的数据预处理过程中去除噪声方面取得了有前景的结果,通过提供更干净的数据,可能会提高下游应用的性能。