Yea Sangjun, Jang Ho, Kim Soyoung, Lee Sanghun, Kim Jaeuk U
Korean medicine data division, Korea Institute of Oriental Medicine, Daejeon, 34054, Republic of Korea.
Korean convergence medical science, University of Science and Technology, Daejeon, 34113, Republic of Korea.
Sci Data. 2025 Jan 7;12(1):26. doi: 10.1038/s41597-025-04377-2.
The Traditional Formula (TF), a combination of herbs prepared in accordance with traditional medicine principles, is increasingly garnering global attention as an alternative to modern medicine. Specifically, there is growing interest in exploring TF's therapeutic effects across various diseases. A significant portion of the state-of-the-art knowledge regarding the relationship between TF and disease is found in scientific publications, where manual knowledge extraction is impractical. Thus, Natural Language Processing (NLP) is being employed to efficiently and accurately search and extract crucial knowledge from unstructured literatures. However, the absence of a high-quality manually annotated corpus focusing on TF-disease relationships hampers the use of NLP in the fields of traditional medicine and modern biomedical science. This article introduces the Traditional Formula-Disease Relationship (TFDR) corpus, a manually annotated corpus designed to facilitate the automatic extraction of TF-disease relationships from biomedical literatures. The TFDR corpus includes information gleaned from 740 PubMed abstracts, encompassing a total of 6,211 TF mentions, 7,166 disease mentions, and 1,109 relationships between them encapsulated within 744 key-sentences.
传统配方(TF)是根据传统医学原则配制的草药组合,作为现代医学的替代方案,正日益受到全球关注。具体而言,人们对探索TF在各种疾病中的治疗效果的兴趣与日俱增。关于TF与疾病关系的最新知识很大一部分存在于科学出版物中,在这些出版物中手动提取知识是不切实际的。因此,自然语言处理(NLP)正被用于从非结构化文献中高效、准确地搜索和提取关键知识。然而,缺乏专注于TF与疾病关系的高质量人工标注语料库阻碍了NLP在传统医学和现代生物医学科学领域的应用。本文介绍了传统配方-疾病关系(TFDR)语料库,这是一个人工标注的语料库,旨在促进从生物医学文献中自动提取TF与疾病的关系。TFDR语料库包含从740篇PubMed摘要中收集的信息,共计6211次提及TF、7166次提及疾病,以及744个关键句子中包含的它们之间的1109种关系。