Department of Computer Science, University of California, Davis, CA, 95616, USA.
Genome Center, University of California, Davis, CA, 95616, USA.
Nat Commun. 2022 Apr 29;13(1):2360. doi: 10.1038/s41467-022-29993-z.
We present a machine learning framework to automate knowledge discovery through knowledge graph construction, inconsistency resolution, and iterative link prediction. By incorporating knowledge from 10 publicly available sources, we construct an Escherichia coli antibiotic resistance knowledge graph with 651,758 triples from 23 triple types after resolving 236 sets of inconsistencies. Iteratively applying link prediction to this graph and wet-lab validation of the generated hypotheses reveal 15 antibiotic resistant E. coli genes, with 6 of them never associated with antibiotic resistance for any microbe. Iterative link prediction leads to a performance improvement and more findings. The probability of positive findings highly correlates with experimentally validated findings (R = 0.94). We also identify 5 homologs in Salmonella enterica that are all validated to confer resistance to antibiotics. This work demonstrates how evidence-driven decisions are a step toward automating knowledge discovery with high confidence and accelerated pace, thereby substituting traditional time-consuming and expensive methods.
我们提出了一个机器学习框架,通过知识图谱构建、不一致性解决和迭代链接预测来实现知识发现的自动化。通过整合来自 10 个公开数据源的知识,我们构建了一个大肠杆菌抗生素抗性知识图谱,其中包含 651758 个三元组,涉及 23 种三元组类型,解决了 236 组不一致性。通过对该图谱进行迭代链接预测,并对生成的假设进行湿实验验证,我们发现了 15 个抗生素抗性大肠杆菌基因,其中 6 个从未与任何微生物的抗生素抗性相关联。迭代链接预测可提高性能并产生更多发现。阳性发现的概率与经过实验验证的发现高度相关(R=0.94)。我们还在肠炎沙门氏菌中鉴定出 5 个同源物,它们均被证实对抗生素具有抗性。这项工作展示了如何通过基于证据的决策来实现具有高置信度和加速步伐的知识发现自动化,从而替代传统的耗时且昂贵的方法。