Vazquez Miguel, Krallinger Martin, Leitner Florian, Kuiper Martin, Valencia Alfonso, Laegreid Astrid
Barcelona Supercomputing Center, Barcelona, Spain.
Barcelona Supercomputing Center, Barcelona, Spain.
Biochim Biophys Acta Gene Regul Mech. 2022 Jan;1865(1):194778. doi: 10.1016/j.bbagrm.2021.194778. Epub 2021 Dec 5.
The regulation of gene transcription by transcription factors is a fundamental biological process, yet the relations between transcription factors (TF) and their target genes (TG) are still only sparsely covered in databases. Text-mining tools can offer broad and complementary solutions to help locate and extract mentions of these biological relationships in articles. We have generated ExTRI, a knowledge graph of TF-TG relationships, by applying a high recall text-mining pipeline to MedLine abstracts identifying over 100,000 candidate sentences with TF-TG relations. Validation procedures indicated that about half of the candidate sentences contain true TF-TG relationships. Post-processing identified 53,000 high confidence sentences containing TF-TG relationships, with a cross-validation F1-score close to 75%. The resulting collection of TF-TG relationships covers 80% of the relations annotated in existing databases. It adds 11,000 other potential interactions, including relationships for ~100 TFs currently not in public TF-TG relation databases. The high confidence abstract sentences contribute 25,000 literature references not available from other resources and offer a wealth of direct pointers to functional aspects of the TF-TG interactions. Our compiled resource encompassing ExTRI together with publicly available resources delivers literature-derived TF-TG interactions for more than 900 of the 1500-1600 proteins considered to function as specific DNA binding TFs. The obtained result can be used by curators, for network analysis and modelling, for causal reasoning or knowledge graph mining approaches, or serve to benchmark text mining strategies.
转录因子对基因转录的调控是一个基本的生物学过程,然而转录因子(TF)与其靶基因(TG)之间的关系在数据库中仍然鲜有涉及。文本挖掘工具可以提供广泛且互补的解决方案,以帮助在文章中定位和提取这些生物学关系的提及。我们通过对MedLine摘要应用高召回率的文本挖掘管道,生成了一个TF-TG关系的知识图谱,识别出超过100,000个具有TF-TG关系的候选句子。验证程序表明,约一半的候选句子包含真实的TF-TG关系。后处理确定了53,000个包含TF-TG关系的高置信度句子,交叉验证F1分数接近75%。由此产生的TF-TG关系集合涵盖了现有数据库中注释关系的80%。它还增加了11,000个其他潜在的相互作用,包括目前不在公共TF-TG关系数据库中的约100个TF的关系。高置信度的摘要句子提供了25,000个其他资源中没有的文献参考,并为TF-TG相互作用的功能方面提供了丰富的直接线索。我们编译的资源包括ExTRI以及公开可用的资源,为1500 - 1600个被认为具有特定DNA结合功能的TF蛋白中的900多个提供了文献衍生的TF-TG相互作用。所获得的结果可供策展人用于网络分析和建模、因果推理或知识图谱挖掘方法,或用于基准测试文本挖掘策略。