Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-Ku, Tokyo, 135-0064, Japan.
Toyota Technological Institute, 2-12-1 Hisakata, Tempaku-Ku, Nagoya, 468-8511, Japan.
Sci Rep. 2023 Apr 12;13(1):5986. doi: 10.1038/s41598-023-32915-8.
Idiopathic pulmonary fibrosis (IPF) is a severe and progressive chronic fibrosing interstitial lung disease with causes that have remained unclear to date. Development of effective treatments will require elucidation of the detailed pathogenetic mechanisms of IPF at both the molecular and cellular levels. With a biomedical corpus that includes IPF-related entities and events, text-mining systems can efficiently extract such mechanism-related information from huge amounts of literature on the disease. A novel corpus consisting of 150 abstracts with 9297 entities intended for training a text-mining system was constructed to clarify IPF-related pathogenetic mechanisms. For this corpus, entity information was annotated, as were relation and event information. To construct IPF-related networks, we also conducted entity normalization with IDs assigned to entities. Thereby, we extracted the same entities, which are expressed differently. Moreover, IPF-related events have been defined in this corpus, in contrast to existing corpora. This corpus will be useful to extract IPF-related information from scientific texts. Because many entities and events are related to lung diseases, this freely available corpus can also be used to extract information related to other lung diseases such as lung cancer and interstitial pneumonia caused by COVID-19.
特发性肺纤维化(IPF)是一种严重且进行性的慢性纤维性间质性肺疾病,其病因至今仍不清楚。要开发有效的治疗方法,就需要在分子和细胞水平上阐明 IPF 的详细发病机制。利用包含与 IPF 相关实体和事件的生物医学语料库,文本挖掘系统可以从大量关于该疾病的文献中高效提取此类与机制相关的信息。为了阐明与特发性肺纤维化相关的发病机制,我们构建了一个由 150 篇摘要和 9297 个实体组成的新型语料库,用于训练文本挖掘系统。对于这个语料库,我们对实体信息、关系信息和事件信息进行了标注。为了构建与特发性肺纤维化相关的网络,我们还对实体进行了 ID 标准化处理,以提取以不同方式表达的相同实体。此外,与现有语料库相比,本语料库还定义了与特发性肺纤维化相关的事件。该语料库将有助于从科学文本中提取与特发性肺纤维化相关的信息。由于许多实体和事件与肺部疾病相关,因此这个免费提供的语料库也可用于提取与其他肺部疾病(如肺癌和由 COVID-19 引起的间质性肺炎)相关的信息。